= D3.4b: A Somali corpus, sized 10 million words, morphologically annotated = == Building the Somali Web corpus == The Building of the corpus is described at [[SomaliCorpus]]. == Corpus properties == Basic properties of corpus sources are summarised below. The size of corpus structures: ||=Document count =|| 385,338|| ||=Paragraph count =|| 1,937,758|| ||=Sentence count =|| 2,643,336|| ||=Token count =|| 79,741,231|| ||=Lexicon size =|| 1,399,350|| == Morphological annotation == The corpus was tagged by a simple tagger based on regular expressions and a lexicon of most frequent words. Manual evaluation of the tagging accuracy will be done in the next phase of the HaBiT project. === Tag-set === The tag-set is based on the POS tags of Universal Dependences [2]. ||=Tag=||=Description=|| ||ADJ|| adjective || ||ADP|| adposition || ||ADV|| adverb || ||AUX|| auxiliary || ||CONJ|| conjunction || ||DET|| determiner || ||NOUN|| noun || ||NUM|| numeral || ||PRON|| pronoun || ||PUNCT|| punctuation || ||SYM|| symbol || ||VERB|| verb || === Tag frequencies === The most frequent parts of speech in the corpus are nouns. The most frequent part of speech tags: ||=Part of speech tag =||=Token count =|| ||NOUN|| 42,711,639|| ||PRON|| 8,458,321|| ||PUNCT|| 6,527,366|| ||PREP|| 4,796,940|| ||CONJ|| 4,147,836|| ||ADP|| 3,352,316|| ||DET|| 2,605,149|| ||VERB|| 2,483,263|| ||ADJ|| 1,466,010|| ||NUM|| 1,357,567|| ||ADV|| 1,139,631|| ||SYM|| 616,690|| == Corpus query interface == The corpus has been indexed by corpus manager and query system Sketch Engine [1]. The corpus can be searched at http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=sowac16. == References == - [1] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36. - [2] -- Nivre, Joakim, et al. "Universal dependencies v1: A multilingual treebank collection." Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). 2016.