D3.2b: An Afaan Oromo corpus, sized 3 million words, morphologically annotated
Building the Oromo Web corpus
The Building of the corpus is described at OromoCorpus.
Corpus properties
Basic properties of corpus sources are summarised below.
The size of corpus structures:
Document count | 8,851 |
---|---|
Paragraph count | 76,115 |
Sentence count | 250,432 |
Token count | 5,091,696 |
Latin script lexicon size | 273,056 |
Morphological annotation
The corpus was tagged by a simple tagger based on regular expressions and a lexicon of most frequent words.
Manual evaluation of the tagging accuracy will be done in the next phase of the HaBiT project.
Tag-set
The tag-set is based on the POS tags of Universal Dependences [2].
Tag | Description |
---|---|
ADJ | adjective |
ADV | adverb |
AUX | auxiliary |
CONJ | conjunction |
DET | determiner |
NOUN | noun |
NUM | numeral |
PREP | preposition |
PRON | pronoun |
PUNCT | punctuation |
SYM | symbol |
VERB | verb |
Tag frequencies
The most frequent parts of speech in the corpus are nouns. The most frequent part of speech tags:
Part of speech tag | Token count |
---|---|
NOUN | 2,873,618 |
PUNCT | 637,157 |
PREP | 296,625 |
PRON | 281,346 |
ADJ | 180,454 |
NUM | 166,468 |
VERB | 135,831 |
CONJ | 130,666 |
SYM | 119,514 |
ADV | 111,580 |
AUX | 110,688 |
DET | 47,749 |
Corpus query interface
The corpus has been indexed by corpus manager and query system Sketch Engine [1]. The corpus can be searched at http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=orwac16.
References
- [1] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.
- [2] -- Nivre, Joakim, et al. "Universal dependencies v1: A multilingual treebank collection." Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). 2016.