wiki:OromoCorpusMorpho

D3.2b: An Afaan Oromo corpus, sized 3 million words, morphologically annotated

Building the Oromo Web corpus

The Building of the corpus is described at OromoCorpus.

Corpus properties

Basic properties of corpus sources are summarised below.

The size of corpus structures:

Document count 8,851
Paragraph count 76,115
Sentence count 250,432
Token count 5,091,696
Latin script lexicon size 273,056

Morphological annotation

The corpus was tagged by a simple tagger based on regular expressions and a lexicon of most frequent words.

Manual evaluation of the tagging accuracy will be done in the next phase of the HaBiT project.

Tag-set

The tag-set is based on the POS tags of Universal Dependences [2].

TagDescription
ADJ adjective
ADV adverb
AUX auxiliary
CONJ conjunction
DET determiner
NOUN noun
NUM numeral
PREP preposition
PRON pronoun
PUNCT punctuation
SYM symbol
VERB verb

Tag frequencies

The most frequent parts of speech in the corpus are nouns. The most frequent part of speech tags:

Part of speech tag Token count
NOUN 2,873,618
PUNCT 637,157
PREP 296,625
PRON 281,346
ADJ 180,454
NUM 166,468
VERB 135,831
CONJ 130,666
SYM 119,514
ADV 111,580
AUX 110,688
DET 47,749

Corpus query interface

The corpus has been indexed by corpus manager and query system Sketch Engine [1]. The corpus can be searched at http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=orwac16.

References

  • [1] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.
  • [2] -- Nivre, Joakim, et al. "Universal dependencies v1: A multilingual treebank collection." Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). 2016.
Last modified 7 years ago Last modified on Jan 19, 2017, 9:25:34 PM