Version 2 (modified by 7 years ago) (diff) | ,
---|
D3.1b: An Amharic corpus, sized 20 million words, morphologically annotated
Building the Amharic Web corpus
The Building of the corpus is described at AmharicCorpus.
Corpus properties
Basic properties of corpus sources are summarised below.
The size of corpus structures:
Document count | 33,542 |
---|---|
Paragraph count | 341,327 |
Sentence count | 1,208,926 |
Token count | 20,287,250 |
Ge'ez lexicon size | 955,628 |
Sera lexicon size | 948,553 |
Morphological annotation
The corpus was tagged by TreeTagger? with a model trained on the cleaned version of the WIC Corpus [1].
The most frequent parts of speech in both corpora are nouns and verbs. The most frequent part of speech tags:
Part of speech tag | Token count |
---|---|
N | 7,386,470 |
NP | 2,660,200 |
VP | 1,601,728 |
V | 1,331,531 |
SENT | 946,905 |
VREL | 920,223 |
PUNC | 741,439 |
PREP | 729,404 |
NUMCR | 687,686 |
ADJ | 647,608 |
PRON | 391,243 |
VN | 389,152 |
AUX | 373,346 |
NC | 322,592 |
CONJ | 292,046 |
PRONP | 204,243 |
ADV | 173,772 |
NPC | 140,109 |
ADJP | 126,138 |
Corpus query interface
The corpus has been indexed by corpus manager and query system Sketch Engine [2]. The corpus can be searched at http://corpora.fi.muni.cz/habit/.
References
- [1] -- RYCHLÝ, Pavel and Vít SUCHOMEL. Annotated Amharic Corpora. In Proceedings of Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno
- [2] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.