Context Navigation

D3.4a: A Somali corpus, sized 10 million words

Building the Somali Web corpus

We have used the following steps to create a big Somali Web corpus: First, adopting the Corpus factory method [1] bigrams of Somali words from the Crúbadán database [2] were used to query Bing search engine for documents in Somali. URLs of 18,108 documents found by the search engine were used as starting points for web crawler SpiderLing [3].

The following language models were created:

Character trigram model for language detection. 292 KB of text from documents found by the search engine and manually checked was used to train the model.
Byte trigram model for character encoding detection. The model was trained using web pages obtained by the Corpus factory method.
304 most frequent Somali words from the manually checked bigrams of words from the Crúbadán database were used as a wordlist to check the language of a running text by boilerplate removal tool jusText [4].

The crawler was set to harvest web domains in national top level domains of Ethiopia, Eritrea, Somalia, Djibouti (et, er, so, dj) and other general TLDs (com, org, net, gov, info, edu). 42 GB of http responses was gathered in the process. HTML tags and boilerplate paragraphs were removed from the raw data by jusText. Somali texts were separated using the character trigram language model.

Duplicate or near duplicate paragraphs were identified and removed using tool onion [4]. The final size of the corpus is 461 MB and 80 million tokens. The corpus is called soWaC16 (Somali `Web as Corpus' corpus, year 2016).

Corpus properties

Basic properties of corpus sources are summarised below.

The size of corpus structures:

Document count	385,338
Paragraph count	1,937,758
Sentence count	2,643,336
Token count	79,741,231
Latin script lexicon size	1,399,350

Document count – the most frequent web domains and domain size distribution:

Top level domains		Web domains		Second level domain size distribution
net	295,358	risaala.net	22,823	At least 1000 documents	73
org	75,860	goolfm.net	22,544	At least 500 documents	96
com	7,397	vidinfo.org	21,904	At least 100 documents	150
info	4,577	batalaalenews.net	17,079	At least 50 documents	181
so	1,930	keydmedia.net	15,453	At least 10 documents	352
		alshahid.net	13,923	At least 5 documents	487
		daadmadheedhnews.net	13,693	At least 1 document	1,083
		somaliland.org	13,203
		vidoser.org	12,189
		somalilandpost.net	10,853
		radiodanan.net	10,196
		geeska.net	8,378
		camuudnews.net	8,218
		nogob.net	7,045
		allsomali24.org	6,755
		sagalradio.org	6,154
		qarninews.net	6,097

Apart from other African languages represented in HaBiT project corpora, the Somalian corpus consists of texts from a broad number of Web domains. The content of news/politics sites has a significant presence in the corpus sources.

The most frequent words:

Word (Latin script)	Count
oo	2,130,200
ka	1,808,365
ay	1,470,184
ku	1,445,719
iyo	1,248,166
ee	1,210,830
ah	1,062,164
u	1,041,418
in	1,037,431
ayaa	985,020
uu	950,971
soo	794,868
la	720,451
lagu	397,822
ugu	365,182

Corpus query interface

The corpus has been indexed by corpus manager and query system Sketch Engine [5]. The corpus can be searched at http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=sowac16.

References

[1] -- Kilgarriff, Adam, Siva Reddy, Jan Pomikálek, and P. V. S. Avinesh. "A Corpus Factory for Many Languages." In LREC. 2010.
[2] -- Scannell, Kevin P. "The Crúbadán Project: Corpus building for under-resourced languages." In Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop, vol. 4, pp. 5-15. 2007.
[3] -- Suchomel, Vít, and Jan Pomikálek. "Efficient web crawling for large text corpora." In Proceedings of the seventh Web as Corpus Workshop (WAC7), pp. 39-43. 2012.
[4] -- Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Disertační práce, Masarykova univerzita, Fakulta informatiky (2011).
[5] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.

Last modified 9 years ago Last modified on Jan 17, 2017, 12:45:07 PM

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text