| 2 | |
| 3 | == Building the Amharic Web corpus == |
| 4 | |
| 5 | The Building of the corpus is described at [[AmharicCorpus]]. |
| 6 | |
| 7 | |
| 8 | == Corpus properties == |
| 9 | Basic properties of corpus sources are summarised below. |
| 10 | |
| 11 | The size of corpus structures: |
| 12 | ||=Document count =|| 33,542|| |
| 13 | ||=Paragraph count =|| 341,327|| |
| 14 | ||=Sentence count =|| 1,208,926|| |
| 15 | ||=Token count =|| 20,287,250|| |
| 16 | ||=Ge'ez lexicon size=|| 955,628|| |
| 17 | ||=Sera lexicon size =|| 948,553|| |
| 18 | |
| 19 | == Morphological annotation == |
| 20 | |
| 21 | The corpus was tagged by TreeTagger with a model trained on the cleaned version of the WIC Corpus [1]. |
| 22 | |
| 23 | |
| 24 | The most frequent parts of speech in both corpora are nouns and verbs. The most frequent part of speech tags: |
| 25 | ||=Part of speech tag =||=Token count =|| |
| 26 | ||N || 7,386,470|| |
| 27 | ||NP || 2,660,200|| |
| 28 | ||VP || 1,601,728|| |
| 29 | ||V || 1,331,531|| |
| 30 | ||SENT || 946,905|| |
| 31 | ||VREL || 920,223|| |
| 32 | ||PUNC || 741,439|| |
| 33 | ||PREP || 729,404|| |
| 34 | ||NUMCR || 687,686|| |
| 35 | ||ADJ || 647,608|| |
| 36 | ||PRON || 391,243|| |
| 37 | ||VN || 389,152|| |
| 38 | ||AUX || 373,346|| |
| 39 | ||NC || 322,592|| |
| 40 | ||CONJ || 292,046|| |
| 41 | ||PRONP || 204,243|| |
| 42 | ||ADV || 173,772|| |
| 43 | ||NPC || 140,109|| |
| 44 | ||ADJP || 126,138|| |
| 45 | |
| 46 | == Corpus query interface == |
| 47 | The corpus has been indexed by corpus manager and query system Sketch Engine [2]. The corpus can be searched at http://corpora.fi.muni.cz/habit/. |
| 48 | |
| 49 | == References == |
| 50 | - [1] -- RYCHLÝ, Pavel and Vít SUCHOMEL. Annotated Amharic Corpora. In Proceedings of Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno |
| 51 | - [2] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36. |