Changes between Version 1 and Version 2 of OromoCorpusMorpho


Ignore:
Timestamp:
Jan 19, 2017, 8:38:30 AM (8 years ago)
Author:
pary
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • OromoCorpusMorpho

    v1 v2  
    11= D3.2b: An Afaan Oromo corpus, sized 3 million words, morphologically annotated =
     2
     3== Building the Oromo Web corpus ==
     4
     5The Building of the corpus is described at [[OromoCorpus]].
     6
     7== Corpus properties ==
     8Basic properties of corpus sources are summarised below.
     9
     10The size of corpus structures:
     11||=Document count    =||     8,851||
     12||=Paragraph count   =||    76,115||
     13||=Sentence count    =||   250,432||
     14||=Token count       =|| 5,091,696||
     15||=Latin script lexicon size =||   273,056||
     16
     17== Morphological annotation ==
     18
     19The corpus was tagged by a simple tagger based on regular expressions and a lexicon of most frequent words.
     20
     21Manual evaluation of the tagging accuracy will be done in the next phase of the HaBiT project.
     22
     23
     24=== Tag-set ===
     25
     26The tag-set is based on the POS tags of Universal Dependences [2].
     27
     28||=Tag=||=Description=||
     29||ADJ|| adjective ||
     30||ADP|| adposition ||
     31||ADV|| adverb ||
     32||AUX|| auxiliary ||
     33||CONJ|| conjunction ||
     34||DET|| determiner ||
     35||NOUN|| noun ||
     36||NUM|| numeral ||
     37||PRON|| pronoun ||
     38||PUNCT|| punctuation ||
     39||SYM|| symbol ||
     40||VERB|| verb ||
     41
     42=== Tag frequencies ===
     43
     44The most frequent parts of speech in the corpus are nouns. The most frequent part of speech tags:
     45||=Part of speech tag =||=Token count =||
     46||NOUN|| 42,711,639||
     47||PRON|| 8,458,321||
     48||PUNCT|| 6,527,366||
     49||PREP|| 4,796,940||
     50||CONJ|| 4,147,836||
     51||ADP|| 3,352,316||
     52||DET|| 2,605,149||
     53||VERB|| 2,483,263||
     54||ADJ|| 1,466,010||
     55||NUM|| 1,357,567||
     56||ADV|| 1,139,631||
     57||SYM|| 616,690||
     58
     59
     60== Corpus query interface ==
     61The corpus has been indexed by corpus manager and query system Sketch Engine [1]. The corpus can be searched at http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=orwac16.
     62
     63== References ==
     64 - [1] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.
     65 - [2] -- Nivre, Joakim, et al. "Universal dependencies v1: A multilingual treebank collection." Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). 2016.