D3.1b: An Amharic corpus, sized 20 million words, morphologically annotated

Building the Amharic Web corpus

The Building of the corpus is described at AmharicCorpus.

Corpus properties

Basic properties of corpus sources are summarised below.

The size of corpus structures:

Document count 33,542
Paragraph count 341,327
Sentence count 1,208,926
Token count 20,287,250
Ge'ez lexicon size 955,628
Sera lexicon size 948,553

Morphological annotation

The corpus was tagged by TreeTagger with a model trained on the cleaned version of the WIC Corpus [1].

The average accuracy of the tagger is 87.4 %.


Basic ClassDefinition of the tagTag
NounVerbal/infinitival noun, formed from any verb VN
Noun attached with a preposition NP
Noun attached with conjunction NC
Noun with a proclitic preposition and an enclitic conjunction NPC
Any other noun N
PronounPronoun attached with preposition PRONP
Pronoun attached with conjunction PRONC
Pronoun with a proclitic preposition and an enclitic conjunction PRONPC
Any other Pronoun PRON
VerbAuxiliary verb AUX
Relative verb VREL
Verb attached with preposition VP
Verb attached with conjunction VC
Verb with a proclitic preposition and an enclitic conjunction VPC
Verb (all other) V
AdjectiveAdjective attached with preposition ADJP
Adjective attached with conjunctions ADJC
Adjective with a proclitic preposition and an enclitic conjunction ADJPC
Any other Adjective ADJ
PrepositionPreposition PREP
ConjunctionConjunction CONJ
AdverbAdverb ADV
NumeralCardinal NUMCR
Ordinal NUMOR
Numeral attached with preposition NUMP
Numeral attached with conjunction NUMC
Numeral with aproclitic preposition and an enclitic conjunction NUMPC
InterjectionInterjections INT
PunctuationPunctuation PUNC
UnclassifiedUnclassified UNC

Tag frequencies

The most frequent parts of speech in both corpora are nouns and verbs. The most frequent part of speech tags:

Part of speech tag Token count
N 7,386,470
NP 2,660,200
VP 1,601,728
V 1,331,531
SENT 946,905
VREL 920,223
PUNC 741,439
PREP 729,404
NUMCR 687,686
ADJ 647,608
PRON 391,243
VN 389,152
AUX 373,346
NC 322,592
CONJ 292,046
PRONP 204,243
ADV 173,772
NPC 140,109
ADJP 126,138

Corpus query interface

The corpus has been indexed by corpus manager and query system Sketch Engine [2]. The corpus can be searched at


  • [1] -- RYCHLÝ, Pavel and Vít SUCHOMEL. Annotated Amharic Corpora. In Proceedings of Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno
  • [2] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.
