| | 2 | |
| | 3 | == Building the Somali Web corpus == |
| | 4 | |
| | 5 | The Building of the corpus is described at [[SomaliCorpus]]. |
| | 6 | |
| | 7 | == Corpus properties == |
| | 8 | Basic properties of corpus sources are summarised below. |
| | 9 | |
| | 10 | The size of corpus structures: |
| | 11 | ||=Document count =|| 385,338|| |
| | 12 | ||=Paragraph count =|| 1,937,758|| |
| | 13 | ||=Sentence count =|| 2,643,336|| |
| | 14 | ||=Token count =|| 79,741,231|| |
| | 15 | ||=Lexicon size =|| 1,399,350|| |
| | 16 | |
| | 17 | |
| | 18 | == Morphological annotation == |
| | 19 | |
| | 20 | The corpus was tagged by a simple tagger based on regular expressions and a lexicon of most frequent words. |
| | 21 | |
| | 22 | Manual evaluation of the tagging accuracy will be done in the next phase of the HaBiT project. |
| | 23 | |
| | 24 | |
| | 25 | === Tag-set === |
| | 26 | |
| | 27 | The tag-set is based on the POS tags of Universal Dependences [2]. |
| | 28 | |
| | 29 | ||=Tag=||=Description=|| |
| | 30 | ||ADJ|| adjective || |
| | 31 | ||ADP|| adposition || |
| | 32 | ||ADV|| adverb || |
| | 33 | ||AUX|| auxiliary || |
| | 34 | ||CONJ|| conjunction || |
| | 35 | ||DET|| determiner || |
| | 36 | ||NOUN|| noun || |
| | 37 | ||NUM|| numeral || |
| | 38 | ||PRON|| pronoun || |
| | 39 | ||PUNCT|| punctuation || |
| | 40 | ||SYM|| symbol || |
| | 41 | ||VERB|| verb || |
| | 42 | |
| | 43 | === Tag frequencies === |
| | 44 | |
| | 45 | The most frequent parts of speech in the corpus are nouns. The most frequent part of speech tags: |
| | 46 | ||=Part of speech tag =||=Token count =|| |
| | 47 | ||NOUN|| 42,711,639|| |
| | 48 | ||PRON|| 8,458,321|| |
| | 49 | ||PUNCT|| 6,527,366|| |
| | 50 | ||PREP|| 4,796,940|| |
| | 51 | ||CONJ|| 4,147,836|| |
| | 52 | ||ADP|| 3,352,316|| |
| | 53 | ||DET|| 2,605,149|| |
| | 54 | ||VERB|| 2,483,263|| |
| | 55 | ||ADJ|| 1,466,010|| |
| | 56 | ||NUM|| 1,357,567|| |
| | 57 | ||ADV|| 1,139,631|| |
| | 58 | ||SYM|| 616,690|| |
| | 59 | |
| | 60 | |
| | 61 | == Corpus query interface == |
| | 62 | The corpus has been indexed by corpus manager and query system Sketch Engine [1]. The corpus can be searched at http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=sowac16. |
| | 63 | |
| | 64 | == References == |
| | 65 | - [1] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36. |
| | 66 | - [2] -- Nivre, Joakim, et al. "Universal dependencies v1: A multilingual treebank collection." Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). 2016. |