= D3.1b: An Amharic corpus, sized 20 million words, morphologically annotated =

== Building the Amharic Web corpus ==

The Building of the corpus is described at [[AmharicCorpus]].


== Corpus properties ==
Basic properties of corpus sources are summarised below.

The size of corpus structures:
||=Document count    =||     33,542||
||=Paragraph count   =||    341,327||
||=Sentence count    =||  1,208,926||
||=Token count       =|| 20,287,250||
||=Ge'ez lexicon size=||    955,628||
||=Sera lexicon size =||    948,553||

== Morphological annotation ==

The corpus was tagged by !TreeTagger with a model trained on the cleaned version of the WIC Corpus [1]. 


The most frequent parts of speech in both corpora are nouns and verbs. The most frequent part of speech tags:
||=Part of speech tag =||=Token count =||
||N	|| 7,386,470||
||NP	|| 2,660,200||
||VP	|| 1,601,728||
||V	|| 1,331,531||
||SENT	||   946,905||
||VREL	||   920,223||
||PUNC	||   741,439||
||PREP	||   729,404||
||NUMCR	||   687,686||
||ADJ	||   647,608||
||PRON	||   391,243||
||VN	||   389,152||
||AUX	||   373,346||
||NC	||   322,592||
||CONJ	||   292,046||
||PRONP	||   204,243||
||ADV	||   173,772||
||NPC	||   140,109||
||ADJP	||   126,138||

== Corpus query interface ==
The corpus has been indexed by corpus manager and query system Sketch Engine [2]. The corpus can be searched at http://corpora.fi.muni.cz/habit/.

== References ==
 - [1] -- RYCHLÝ, Pavel and Vít SUCHOMEL. Annotated Amharic Corpora. In Proceedings of Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno
 - [2] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.