D3.1a: An Amharic corpus, sized 20 million words
Building the Amharic Web corpus
We have used the following steps to create a big Web corpus: First, adopting the Corpus factory method [1] bigrams of Amharic words from the Crúbadán database [2] were used to query Bing search engine for documents in Amharic. 354 queries yielded 6,453 URLs. URLs of 3,145 successfully downloaded documents were used as starting points for web crawler SpiderLing [3]. URLs of documents crawled in 2013 using a similar approach were added to the set of starting points.
The following language models were created:
- Character trigram model for language detection. 5.2 MB of text from the WIC Corpus and Amharic Wikipedia was used to train the model.
- Byte trigram model for character encoding detection. The model was trained using web pages obtained by the Corpus factory method.
- The most frequent Amharic words from the WIC Corpus wordlist were used as a resource for boilerplate removal tool jusText [4].
The crawler was set to harvest web domains in the Ethiopian national top level domain et and other general TLDs: com, org, info, net, edu. 3.6 GB of http responses was gathered in the process. HTML tags and boilerplate paragraphs were removed from the raw data. 42 % of paragraphs were identified as duplicate or near duplicate and removed using tool onion [4]. 66 MB of deduplicated text obtained by the same process in 2013 was added to the data. Sentence boundaries were marked at positions with Amharic end of sentence characters "።" and "፧". The final size of the corpus (containing data from years 2013, 2015 and 2016) is 461 MB of 20.3 million tokens. Finally, the corpus was tagged by TreeTagger with a model trained on the cleaned version of the WIC Corpus. The corpus is called amWaC16 (Amharic `Web as Corpus' corpus, year 2016).
Corpus properties
Basic properties of corpus sources are summarised below.
The size of corpus structures:
Document count | 33,542 |
---|---|
Paragraph count | 341,327 |
Sentence count | 1,208,926 |
Token count | 20,287,250 |
Ge'ez script lexicon size | 955,628 |
Sera transliteration lexicon size | 948,553 |
Document count – the most frequent web domains and domain size distribution:
Top level domains | Web domains | Domain size distribution | |||
---|---|---|---|---|---|
org | 14,582 | *.jw.org | 6,717 | At least 1000 documents | 7 |
com | 11,927 | *.gov.et | 4,599 | At least 500 documents | 15 |
et | 5,090 | waltainfo.com | 2,818 | At least 100 documents | 42 |
net | 1,084 | ginbot7.org | 2,666 | At least 50 documents | 63 |
cz | 724 | eotcmk.org | 1,141 | At least 10 documents | 149 |
info | 85 | ethsat.com | 894 | At least 1 document | 573 |
We observe the content of news/politic and religious sites has a significant presence in the corpus sources. Since there are only 149 domains with more than 10 documents represented in the corpus, the result collection would benefit from a greater variety of sources.
The most frequent words:
Word (Ge'ez script) | Word (Sera transliteration) | Count |
---|---|---|
ነው | new | 155,520 |
ላይ | lay | 91,592 |
እና | Ina | 49,733 |
ውስጥ | wsT | 42,429 |
ግን | gn | 39,537 |
ወደ | wede | 39,162 |
ጋር | gar | 36,057 |
ነበር | neber | 34,055 |
ነገር | neger | 30,670 |
ጊዜ | gizE | 27,413 |
ደግሞ | degmo | 26,890 |
ይህ | yh | 25,622 |
አንድ | and | 25,546 |
ብቻ | bca | 23,468 |
ቤት | bEt | 22,466 |
Corpus query interface
The corpus has been indexed by corpus manager and query system Sketch Engine [5]. The corpus can be searched at http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=amwac16.
References
- [1] -- Kilgarriff, Adam, Siva Reddy, Jan Pomikálek, and P. V. S. Avinesh. "A Corpus Factory for Many Languages." In LREC. 2010.
- [2] -- Scannell, Kevin P. "The Crúbadán Project: Corpus building for under-resourced languages." In Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop, vol. 4, pp. 5-15. 2007.
- [3] -- Suchomel, Vít, and Jan Pomikálek. "Efficient web crawling for large text corpora." In Proceedings of the seventh Web as Corpus Workshop (WAC7), pp. 39-43. 2012.
- [4] -- Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Disertační práce, Masarykova univerzita, Fakulta informatiky (2011).
- [5] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.