= D3.1b: An Amharic corpus, sized 20 million words, morphologically annotated = == Building the Amharic Web corpus == The Building of the corpus is described at [[AmharicCorpus]]. == Corpus properties == Basic properties of corpus sources are summarised below. The size of corpus structures: ||=Document count =|| 33,542|| ||=Paragraph count =|| 341,327|| ||=Sentence count =|| 1,208,926|| ||=Token count =|| 20,287,250|| ||=Ge'ez lexicon size=|| 955,628|| ||=Sera lexicon size =|| 948,553|| == Morphological annotation == The corpus was tagged by !TreeTagger with a model trained on the cleaned version of the WIC Corpus [1]. The average accuracy of the tagger is 87.4 %. === Tag-set === ||=Basic Class=||=Definition of the tag=||=Tag=|| ||Noun||Verbal/infinitival noun, formed from any verb|| VN || || ||Noun attached with a preposition|| NP || || ||Noun attached with conjunction|| NC || || ||Noun with a proclitic preposition and an enclitic conjunction|| NPC || || ||Any other noun|| N || ||Pronoun||Pronoun attached with preposition|| PRONP || || ||Pronoun attached with conjunction|| PRONC || || ||Pronoun with a proclitic preposition and an enclitic conjunction|| PRONPC || || ||Any other Pronoun|| PRON || ||Verb||Auxiliary verb|| AUX || || ||Relative verb|| VREL || || ||Verb attached with preposition|| VP || || ||Verb attached with conjunction|| VC || || ||Verb with a proclitic preposition and an enclitic conjunction|| VPC || || ||Verb (all other)|| V || ||Adjective||Adjective attached with preposition|| ADJP || || ||Adjective attached with conjunctions|| ADJC || || ||Adjective with a proclitic preposition and an enclitic conjunction|| ADJPC || || ||Any other Adjective|| ADJ || ||Preposition||Preposition|| PREP || ||Conjunction||Conjunction|| CONJ || ||Adverb||Adverb|| ADV || ||Numeral||Cardinal|| NUMCR || || ||Ordinal|| NUMOR || || ||Numeral attached with preposition|| NUMP || || ||Numeral attached with conjunction|| NUMC || || ||Numeral with aproclitic preposition and an enclitic conjunction|| NUMPC || ||Interjection||Interjections|| INT || ||Punctuation||Punctuation|| PUNC || ||Unclassified||Unclassified|| UNC || === Tag frequencies === The most frequent parts of speech in both corpora are nouns and verbs. The most frequent part of speech tags: ||=Part of speech tag =||=Token count =|| ||N || 7,386,470|| ||NP || 2,660,200|| ||VP || 1,601,728|| ||V || 1,331,531|| ||SENT || 946,905|| ||VREL || 920,223|| ||PUNC || 741,439|| ||PREP || 729,404|| ||NUMCR || 687,686|| ||ADJ || 647,608|| ||PRON || 391,243|| ||VN || 389,152|| ||AUX || 373,346|| ||NC || 322,592|| ||CONJ || 292,046|| ||PRONP || 204,243|| ||ADV || 173,772|| ||NPC || 140,109|| ||ADJP || 126,138|| == Corpus query interface == The corpus has been indexed by corpus manager and query system Sketch Engine [2]. The corpus can be searched at http://corpora.fi.muni.cz/habit/. == References == - [1] -- RYCHLÝ, Pavel and Vít SUCHOMEL. Annotated Amharic Corpora. In Proceedings of Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno - [2] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.