wiki:AmharicCorpusMorpho

Version 3 (modified by hales, 7 years ago) (diff)

--

D3.1b: An Amharic corpus, sized 20 million words, morphologically annotated

Building the Amharic Web corpus

The Building of the corpus is described at AmharicCorpus.

Corpus properties

Basic properties of corpus sources are summarised below.

The size of corpus structures:

Document count 33,542
Paragraph count 341,327
Sentence count 1,208,926
Token count 20,287,250
Ge'ez lexicon size 955,628
Sera lexicon size 948,553

Morphological annotation

The corpus was tagged by TreeTagger with a model trained on the cleaned version of the WIC Corpus [1].

The most frequent parts of speech in both corpora are nouns and verbs. The most frequent part of speech tags:

Part of speech tag Token count
N 7,386,470
NP 2,660,200
VP 1,601,728
V 1,331,531
SENT 946,905
VREL 920,223
PUNC 741,439
PREP 729,404
NUMCR 687,686
ADJ 647,608
PRON 391,243
VN 389,152
AUX 373,346
NC 322,592
CONJ 292,046
PRONP 204,243
ADV 173,772
NPC 140,109
ADJP 126,138

Corpus query interface

The corpus has been indexed by corpus manager and query system Sketch Engine [2]. The corpus can be searched at http://corpora.fi.muni.cz/habit/.

References

  • [1] -- RYCHLÝ, Pavel and Vít SUCHOMEL. Annotated Amharic Corpora. In Proceedings of Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno
  • [2] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.