| 2 | |
| 3 | == Building the Somali Web corpus == |
| 4 | |
| 5 | The Building of the corpus is described at [[SomaliCorpus]]. |
| 6 | |
| 7 | == Corpus properties == |
| 8 | Basic properties of corpus sources are summarised below. |
| 9 | |
| 10 | The size of corpus structures: |
| 11 | ||=Document count =|| 385,338|| |
| 12 | ||=Paragraph count =|| 1,937,758|| |
| 13 | ||=Sentence count =|| 2,643,336|| |
| 14 | ||=Token count =|| 79,741,231|| |
| 15 | ||=Lexicon size =|| 1,399,350|| |
| 16 | |
| 17 | |
| 18 | == Morphological annotation == |
| 19 | |
| 20 | The corpus was tagged by a simple tagger based on regular expressions and a lexicon of most frequent words. |
| 21 | |
| 22 | Manual evaluation of the tagging accuracy will be done in the next phase of the HaBiT project. |
| 23 | |
| 24 | |
| 25 | === Tag-set === |
| 26 | |
| 27 | The tag-set is based on the POS tags of Universal Dependences [2]. |
| 28 | |
| 29 | ||=Tag=||=Description=|| |
| 30 | ||ADJ|| adjective || |
| 31 | ||ADP|| adposition || |
| 32 | ||ADV|| adverb || |
| 33 | ||AUX|| auxiliary || |
| 34 | ||CONJ|| conjunction || |
| 35 | ||DET|| determiner || |
| 36 | ||NOUN|| noun || |
| 37 | ||NUM|| numeral || |
| 38 | ||PRON|| pronoun || |
| 39 | ||PUNCT|| punctuation || |
| 40 | ||SYM|| symbol || |
| 41 | ||VERB|| verb || |
| 42 | |
| 43 | === Tag frequencies === |
| 44 | |
| 45 | The most frequent parts of speech in the corpus are nouns. The most frequent part of speech tags: |
| 46 | ||=Part of speech tag =||=Token count =|| |
| 47 | ||NOUN|| 42,711,639|| |
| 48 | ||PRON|| 8,458,321|| |
| 49 | ||PUNCT|| 6,527,366|| |
| 50 | ||PREP|| 4,796,940|| |
| 51 | ||CONJ|| 4,147,836|| |
| 52 | ||ADP|| 3,352,316|| |
| 53 | ||DET|| 2,605,149|| |
| 54 | ||VERB|| 2,483,263|| |
| 55 | ||ADJ|| 1,466,010|| |
| 56 | ||NUM|| 1,357,567|| |
| 57 | ||ADV|| 1,139,631|| |
| 58 | ||SYM|| 616,690|| |
| 59 | |
| 60 | |
| 61 | == Corpus query interface == |
| 62 | The corpus has been indexed by corpus manager and query system Sketch Engine [1]. The corpus can be searched at http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=sowac16. |
| 63 | |
| 64 | == References == |
| 65 | - [1] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36. |
| 66 | - [2] -- Nivre, Joakim, et al. "Universal dependencies v1: A multilingual treebank collection." Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). 2016. |