D3.1b: An Amharic corpus, sized 20 million words, morphologically annotated
Building the Amharic Web corpus
The Building of the corpus is described at AmharicCorpus.
Corpus properties
Basic properties of corpus sources are summarised below.
The size of corpus structures:
Document count | 33,542 |
---|---|
Paragraph count | 341,327 |
Sentence count | 1,208,926 |
Token count | 20,287,250 |
Ge'ez lexicon size | 955,628 |
Sera lexicon size | 948,553 |
Morphological annotation
The corpus was tagged by TreeTagger with a model trained on the cleaned version of the WIC Corpus [1].
The average accuracy of the tagger is 87.4 %.
Tag-set
Basic Class | Definition of the tag | Tag |
---|---|---|
Noun | Verbal/infinitival noun, formed from any verb | VN |
Noun attached with a preposition | NP | |
Noun attached with conjunction | NC | |
Noun with a proclitic preposition and an enclitic conjunction | NPC | |
Any other noun | N | |
Pronoun | Pronoun attached with preposition | PRONP |
Pronoun attached with conjunction | PRONC | |
Pronoun with a proclitic preposition and an enclitic conjunction | PRONPC | |
Any other Pronoun | PRON | |
Verb | Auxiliary verb | AUX |
Relative verb | VREL | |
Verb attached with preposition | VP | |
Verb attached with conjunction | VC | |
Verb with a proclitic preposition and an enclitic conjunction | VPC | |
Verb (all other) | V | |
Adjective | Adjective attached with preposition | ADJP |
Adjective attached with conjunctions | ADJC | |
Adjective with a proclitic preposition and an enclitic conjunction | ADJPC | |
Any other Adjective | ADJ | |
Preposition | Preposition | PREP |
Conjunction | Conjunction | CONJ |
Adverb | Adverb | ADV |
Numeral | Cardinal | NUMCR |
Ordinal | NUMOR | |
Numeral attached with preposition | NUMP | |
Numeral attached with conjunction | NUMC | |
Numeral with aproclitic preposition and an enclitic conjunction | NUMPC | |
Interjection | Interjections | INT |
Punctuation | Punctuation | PUNC |
Unclassified | Unclassified | UNC |
Tag frequencies
The most frequent parts of speech in both corpora are nouns and verbs. The most frequent part of speech tags:
Part of speech tag | Token count |
---|---|
N | 7,386,470 |
NP | 2,660,200 |
VP | 1,601,728 |
V | 1,331,531 |
SENT | 946,905 |
VREL | 920,223 |
PUNC | 741,439 |
PREP | 729,404 |
NUMCR | 687,686 |
ADJ | 647,608 |
PRON | 391,243 |
VN | 389,152 |
AUX | 373,346 |
NC | 322,592 |
CONJ | 292,046 |
PRONP | 204,243 |
ADV | 173,772 |
NPC | 140,109 |
ADJP | 126,138 |
Corpus query interface
The corpus has been indexed by corpus manager and query system Sketch Engine [2]. The corpus can be searched at http://corpora.fi.muni.cz/habit/.
References
- [1] -- RYCHLÝ, Pavel and Vít SUCHOMEL. Annotated Amharic Corpora. In Proceedings of Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno
- [2] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.
Last modified 7 years ago
Last modified on Jan 19, 2017, 1:55:45 AM