| | 2 | |
| | 3 | == Building the Amharic Web corpus == |
| | 4 | |
| | 5 | The Building of the corpus is described at [[AmharicCorpus]]. |
| | 6 | |
| | 7 | |
| | 8 | == Corpus properties == |
| | 9 | Basic properties of corpus sources are summarised below. |
| | 10 | |
| | 11 | The size of corpus structures: |
| | 12 | ||=Document count =|| 33,542|| |
| | 13 | ||=Paragraph count =|| 341,327|| |
| | 14 | ||=Sentence count =|| 1,208,926|| |
| | 15 | ||=Token count =|| 20,287,250|| |
| | 16 | ||=Ge'ez lexicon size=|| 955,628|| |
| | 17 | ||=Sera lexicon size =|| 948,553|| |
| | 18 | |
| | 19 | == Morphological annotation == |
| | 20 | |
| | 21 | The corpus was tagged by TreeTagger with a model trained on the cleaned version of the WIC Corpus [1]. |
| | 22 | |
| | 23 | |
| | 24 | The most frequent parts of speech in both corpora are nouns and verbs. The most frequent part of speech tags: |
| | 25 | ||=Part of speech tag =||=Token count =|| |
| | 26 | ||N || 7,386,470|| |
| | 27 | ||NP || 2,660,200|| |
| | 28 | ||VP || 1,601,728|| |
| | 29 | ||V || 1,331,531|| |
| | 30 | ||SENT || 946,905|| |
| | 31 | ||VREL || 920,223|| |
| | 32 | ||PUNC || 741,439|| |
| | 33 | ||PREP || 729,404|| |
| | 34 | ||NUMCR || 687,686|| |
| | 35 | ||ADJ || 647,608|| |
| | 36 | ||PRON || 391,243|| |
| | 37 | ||VN || 389,152|| |
| | 38 | ||AUX || 373,346|| |
| | 39 | ||NC || 322,592|| |
| | 40 | ||CONJ || 292,046|| |
| | 41 | ||PRONP || 204,243|| |
| | 42 | ||ADV || 173,772|| |
| | 43 | ||NPC || 140,109|| |
| | 44 | ||ADJP || 126,138|| |
| | 45 | |
| | 46 | == Corpus query interface == |
| | 47 | The corpus has been indexed by corpus manager and query system Sketch Engine [2]. The corpus can be searched at http://corpora.fi.muni.cz/habit/. |
| | 48 | |
| | 49 | == References == |
| | 50 | - [1] -- RYCHLÝ, Pavel and Vít SUCHOMEL. Annotated Amharic Corpora. In Proceedings of Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno |
| | 51 | - [2] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36. |