Changes between Version 1 and Version 2 of SomaliCorpusMorpho


Ignore:
Timestamp:
Jan 19, 2017, 2:36:47 AM (7 years ago)
Author:
pary
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SomaliCorpusMorpho

    v1 v2  
    11= D3.4b: A Somali corpus, sized 10 million words, morphologically annotated =
     2
     3== Building the Somali Web corpus ==
     4
     5The Building of the corpus is described at [[SomaliCorpus]].
     6
     7== Corpus properties ==
     8Basic properties of corpus sources are summarised below.
     9
     10The size of corpus structures:
     11||=Document count    =||    385,338||
     12||=Paragraph count   =||  1,937,758||
     13||=Sentence count    =||  2,643,336||
     14||=Token count       =|| 79,741,231||
     15||=Lexicon size =||  1,399,350||
     16
     17
     18== Morphological annotation ==
     19
     20The corpus was tagged by a simple tagger based on regular expressions and a lexicon of most frequent words.
     21
     22Manual evaluation of the tagging accuracy will be done in the next phase of the HaBiT project.
     23
     24
     25=== Tag-set ===
     26
     27The tag-set is based on the POS tags of Universal Dependences [2].
     28
     29||=Tag=||=Description=||
     30||ADJ|| adjective ||
     31||ADP|| adposition ||
     32||ADV|| adverb ||
     33||AUX|| auxiliary ||
     34||CONJ|| conjunction ||
     35||DET|| determiner ||
     36||NOUN|| noun ||
     37||NUM|| numeral ||
     38||PRON|| pronoun ||
     39||PUNCT|| punctuation ||
     40||SYM|| symbol ||
     41||VERB|| verb ||
     42
     43=== Tag frequencies ===
     44
     45The most frequent parts of speech in the corpus are nouns. The most frequent part of speech tags:
     46||=Part of speech tag =||=Token count =||
     47||NOUN|| 42,711,639||
     48||PRON|| 8,458,321||
     49||PUNCT|| 6,527,366||
     50||PREP|| 4,796,940||
     51||CONJ|| 4,147,836||
     52||ADP|| 3,352,316||
     53||DET|| 2,605,149||
     54||VERB|| 2,483,263||
     55||ADJ|| 1,466,010||
     56||NUM|| 1,357,567||
     57||ADV|| 1,139,631||
     58||SYM|| 616,690||
     59
     60
     61== Corpus query interface ==
     62The corpus has been indexed by corpus manager and query system Sketch Engine [1]. The corpus can be searched at http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=sowac16.
     63
     64== References ==
     65 - [1] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.
     66 - [2] -- Nivre, Joakim, et al. "Universal dependencies v1: A multilingual treebank collection." Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). 2016.