Changes between Version 1 and Version 2 of TigrinyaCorpusMorpho


Ignore:
Timestamp:
Jan 19, 2017, 2:05:58 AM (8 years ago)
Author:
pary
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • TigrinyaCorpusMorpho

    v1 v2  
    11= D3.3b: A Tigrinya corpus, sized 3 million words, morphologically annotated =
     2
     3== Building the Tigrinya Web corpus ==
     4
     5The Building of the corpus is described at [[TigrinyaCorpus]].
     6
     7== Corpus properties ==
     8Basic properties of corpus sources are summarised below.
     9
     10The size of corpus structures:
     11||=Document count    =||     1,907||
     12||=Paragraph count   =||    28,552||
     13||=Sentence count    =||   139,357||
     14||=Token count       =|| 2,531,443||
     15||=Ge'ez script lexicon size =||   225,132||
     16||=Sera transliteration lexicon size  =||   220,935||
     17
     18== Morphological annotation ==
     19
     20The corpus was tagged by !TreeTagger with a model trained on the cleaned version of the WIC Corpus [1].
     21
     22Manual evaluation of the tagging accuracy will be done in the next phase of the HaBiT project.
     23
     24
     25=== Tag-set ===
     26
     27||=Basic Class=||=Definition of the tag=||=Tag=||
     28||Noun||Verbal/infinitival noun, formed from any verb|| VN ||
     29|| ||Noun attached with a preposition|| NP ||
     30|| ||Noun attached with conjunction|| NC ||
     31|| ||Noun with a proclitic preposition and an enclitic conjunction|| NPC ||
     32|| ||Any other noun|| N ||
     33||Pronoun||Pronoun attached with preposition|| PRONP ||
     34|| ||Pronoun attached with conjunction|| PRONC ||
     35|| ||Pronoun with a proclitic preposition and an enclitic conjunction|| PRONPC ||
     36|| ||Any other Pronoun|| PRON ||
     37||Verb||Auxiliary verb|| AUX ||
     38|| ||Relative verb|| VREL ||
     39|| ||Verb attached with preposition|| VP ||
     40|| ||Verb attached with conjunction|| VC ||
     41|| ||Verb with a proclitic preposition and an enclitic conjunction|| VPC ||
     42|| ||Verb (all other)|| V ||
     43||Adjective||Adjective attached with preposition|| ADJP ||
     44|| ||Adjective attached with conjunctions|| ADJC ||
     45|| ||Adjective with a proclitic preposition and an enclitic conjunction|| ADJPC  ||
     46|| ||Any other Adjective|| ADJ ||
     47||Preposition||Preposition|| PREP ||
     48||Conjunction||Conjunction|| CONJ ||
     49||Adverb||Adverb|| ADV ||
     50||Numeral||Cardinal|| NUMCR ||
     51|| ||Ordinal|| NUMOR ||
     52|| ||Numeral attached with preposition|| NUMP ||
     53|| ||Numeral attached with conjunction|| NUMC ||
     54|| ||Numeral with aproclitic preposition and an enclitic conjunction|| NUMPC ||
     55||Interjection||Interjections|| INT ||
     56||Punctuation||Punctuation|| PUNC ||
     57||Unclassified||Unclassified|| UNC  ||
     58
     59=== Tag frequencies ===
     60
     61The most frequent parts of speech in the corpus are nouns. The most frequent part of speech tags:
     62||=Part of speech tag =||=Token count =||
     63||N|| 1,676,460||
     64||PUNC|| 135,685||
     65||NP|| 135,676||
     66||SENT|| 116,574||
     67||V|| 106,615||
     68||NUMCR|| 91,516||
     69||VP|| 62,990||
     70||NC|| 60,589||
     71||ADJ|| 56,009||
     72||VN|| 21,778||
     73||VREL|| 16,530||
     74||CONJ|| 11,953||
     75||ADV|| 10,193||
     76||PREP|| 7,444||
     77||NPC|| 5,954||
     78||UNC|| 5,723||
     79||VPC|| 4,532||
     80||ADJC|| 1,455||
     81||PRON|| 1,219||
     82||ADJPC|| 1,016||
     83