Context Navigation

D3.2a: An Afaan Oromo corpus, sized 3 million words

Building the Afaan Oromo Web corpus

We have used the following steps to create a big Oromo Web corpus: First, adopting the Corpus factory method [1] bigrams of Oromo words from the Crúbadán database [2] were used to query Bing search engine for documents in Oromo. URLs of 9,847 documents found by the search engine were used as starting points for web crawler SpiderLing [3].

The following language models were created:

Character trigram model for language detection. 372 KB of text from documents found by the search engine and manually checked was used to train the model.
Byte trigram model for character encoding detection. The model was trained using web pages obtained by the Corpus factory method.
253 most frequent Oromo words from the manually checked bigrams of words from the Crúbadán database were used as a wordlist to check the language of a running text by boilerplate removal tool jusText [4].

The crawler was set to harvest web domains in national top level domains of Ethiopia, Eritrea, Somalia, Djibouti (et, er, so, dj) and other general TLDs (com, org, net, gov, info, edu). 42 GB of http responses was gathered in the process. HTML tags and boilerplate paragraphs were removed from the raw data by jusText. Oromo texts were separated using the character trigram language model.

Duplicate or near duplicate paragraphs were identified and removed using tool onion [4]. The final size of the corpus is 32 MB and 5.1 million tokens. The corpus is called orWaC16 (Oromo `Web as Corpus' corpus, year 2016).

Corpus properties

Basic properties of corpus sources are summarised below.

The size of corpus structures:

Document count	8,851
Paragraph count	76,115
Sentence count	250,432
Token count	5,091,696
Latin script lexicon size	273,056

Document count – the most frequent web domains and domain size distribution:

Top level domains		Web domains		Second level domain size distribution
org	5,676	*.jw.org	2,695	At least 1000 documents	2
com	2,054	qeerroo.org	1,010	At least 500 documents	4
net	839	vidoser.org	632	At least 100 documents	16
et	213	gadaa.net\|com	518	At least 50 documents	21
		*.voaafaanoromoo.com	438	At least 10 documents	45
		oromedia.net	304	At least 5 documents	60
		bilisummaa.com	291	At least 1 document	190
		*.blogspot.com	287
		oromiatimes.org	276

The content of news/politics and religious sites has a significant presence in the corpus sources.

The most frequent words:

Word (Latin script)	Count
akka	82,032
kan	71,775
hin	65,390
fi	64,710
Oromoo	32,189
kana	26,818
tokko	26,699
itti	25,926
waan	24,733
yeroo	22,580
keessatti	22,189
isa	21,732
isaa	21,636
irratti	20,666
jiru	20,655

Corpus query interface

The corpus has been indexed by corpus manager and query system Sketch Engine [5]. The corpus can be searched at http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=orwac16.

References

[1] -- Kilgarriff, Adam, Siva Reddy, Jan Pomikálek, and P. V. S. Avinesh. "A Corpus Factory for Many Languages." In LREC. 2010.
[2] -- Scannell, Kevin P. "The Crúbadán Project: Corpus building for under-resourced languages." In Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop, vol. 4, pp. 5-15. 2007.
[3] -- Suchomel, Vít, and Jan Pomikálek. "Efficient web crawling for large text corpora." In Proceedings of the seventh Web as Corpus Workshop (WAC7), pp. 39-43. 2012.
[4] -- Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Disertační práce, Masarykova univerzita, Fakulta informatiky (2011).
[5] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.

Last modified 9 years ago Last modified on Jan 17, 2017, 12:42:34 PM

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text