Context Navigation

← Previous Change
Wiki History
Next Change →

Changes between Version 1 and Version 2 of SomaliCorpusMorpho

Timestamp:: Jan 19, 2017, 2:36:47 AM (9 years ago)
Author:: pary
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

SomaliCorpusMorpho

-              v1
+              v2
 = D3.4b: A Somali corpus, sized 10 million words, morphologically annotated =
+== Building the Somali Web corpus ==
+The Building of the corpus is described at [[SomaliCorpus]].
+== Corpus properties ==
+Basic properties of corpus sources are summarised below.
+The size of corpus structures:
+||=Document count    =||    385,338||
+||=Paragraph count   =||  1,937,758||
+||=Sentence count    =||  2,643,336||
+||=Token count       =|| 79,741,231||
+||=Lexicon size =||  1,399,350||
+== Morphological annotation ==
+The corpus was tagged by a simple tagger based on regular expressions and a lexicon of most frequent words.
+Manual evaluation of the tagging accuracy will be done in the next phase of the HaBiT project.
+=== Tag-set ===
+The tag-set is based on the POS tags of Universal Dependences [2].
+||=Tag=||=Description=||
+||ADJ|| adjective ||
+||ADP|| adposition ||
+||ADV|| adverb ||
+||AUX|| auxiliary ||
+||CONJ|| conjunction ||
+||DET|| determiner ||
+||NOUN|| noun ||
+||NUM|| numeral ||
+||PRON|| pronoun ||
+||PUNCT|| punctuation ||
+||SYM|| symbol ||
+||VERB|| verb ||
+=== Tag frequencies ===
+The most frequent parts of speech in the corpus are nouns. The most frequent part of speech tags:
+||=Part of speech tag =||=Token count =||
+||NOUN|| 42,711,639||
+||PRON|| 8,458,321||
+||PUNCT|| 6,527,366||
+||PREP|| 4,796,940||
+||CONJ|| 4,147,836||
+||ADP|| 3,352,316||
+||DET|| 2,605,149||
+||VERB|| 2,483,263||
+||ADJ|| 1,466,010||
+||NUM|| 1,357,567||
+||ADV|| 1,139,631||
+||SYM|| 616,690||
+== Corpus query interface ==
+The corpus has been indexed by corpus manager and query system Sketch Engine [1]. The corpus can be searched at http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=sowac16.
+== References ==
+ - [1] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.
+ - [2] -- Nivre, Joakim, et al. "Universal dependencies v1: A multilingual treebank collection." Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). 2016.