= D3.2b: An Afaan Oromo corpus, sized 3 million words, morphologically annotated = == Building the Oromo Web corpus == The Building of the corpus is described at [[OromoCorpus]]. == Corpus properties == Basic properties of corpus sources are summarised below. The size of corpus structures: ||=Document count =|| 8,851|| ||=Paragraph count =|| 76,115|| ||=Sentence count =|| 250,432|| ||=Token count =|| 5,091,696|| ||=Latin script lexicon size =|| 273,056|| == Morphological annotation == The corpus was tagged by a simple tagger based on regular expressions and a lexicon of most frequent words. Manual evaluation of the tagging accuracy will be done in the next phase of the HaBiT project. === Tag-set === The tag-set is based on the POS tags of Universal Dependences [2]. ||=Tag=||=Description=|| ||ADJ|| adjective || ||ADV|| adverb || ||AUX|| auxiliary || ||CONJ|| conjunction || ||DET|| determiner || ||NOUN|| noun || ||NUM|| numeral || ||PREP|| preposition || ||PRON|| pronoun || ||PUNCT|| punctuation || ||SYM|| symbol || ||VERB|| verb || === Tag frequencies === The most frequent parts of speech in the corpus are nouns. The most frequent part of speech tags: ||=Part of speech tag =||=Token count =|| ||NOUN || 2,873,618|| ||PUNCT || 637,157|| ||PREP || 296,625|| ||PRON || 281,346|| ||ADJ || 180,454|| ||NUM || 166,468|| ||VERB || 135,831|| ||CONJ || 130,666|| ||SYM || 119,514|| ||ADV || 111,580|| ||AUX || 110,688|| ||DET || 47,749|| == Corpus query interface == The corpus has been indexed by corpus manager and query system Sketch Engine [1]. The corpus can be searched at http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=orwac16. == References == - [1] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36. - [2] -- Nivre, Joakim, et al. "Universal dependencies v1: A multilingual treebank collection." Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). 2016.