| 2 | |
| 3 | == Building the Tigrinya Web corpus == |
| 4 | |
| 5 | The Building of the corpus is described at [[TigrinyaCorpus]]. |
| 6 | |
| 7 | == Corpus properties == |
| 8 | Basic properties of corpus sources are summarised below. |
| 9 | |
| 10 | The size of corpus structures: |
| 11 | ||=Document count =|| 1,907|| |
| 12 | ||=Paragraph count =|| 28,552|| |
| 13 | ||=Sentence count =|| 139,357|| |
| 14 | ||=Token count =|| 2,531,443|| |
| 15 | ||=Ge'ez script lexicon size =|| 225,132|| |
| 16 | ||=Sera transliteration lexicon size =|| 220,935|| |
| 17 | |
| 18 | == Morphological annotation == |
| 19 | |
| 20 | The corpus was tagged by !TreeTagger with a model trained on the cleaned version of the WIC Corpus [1]. |
| 21 | |
| 22 | Manual evaluation of the tagging accuracy will be done in the next phase of the HaBiT project. |
| 23 | |
| 24 | |
| 25 | === Tag-set === |
| 26 | |
| 27 | ||=Basic Class=||=Definition of the tag=||=Tag=|| |
| 28 | ||Noun||Verbal/infinitival noun, formed from any verb|| VN || |
| 29 | || ||Noun attached with a preposition|| NP || |
| 30 | || ||Noun attached with conjunction|| NC || |
| 31 | || ||Noun with a proclitic preposition and an enclitic conjunction|| NPC || |
| 32 | || ||Any other noun|| N || |
| 33 | ||Pronoun||Pronoun attached with preposition|| PRONP || |
| 34 | || ||Pronoun attached with conjunction|| PRONC || |
| 35 | || ||Pronoun with a proclitic preposition and an enclitic conjunction|| PRONPC || |
| 36 | || ||Any other Pronoun|| PRON || |
| 37 | ||Verb||Auxiliary verb|| AUX || |
| 38 | || ||Relative verb|| VREL || |
| 39 | || ||Verb attached with preposition|| VP || |
| 40 | || ||Verb attached with conjunction|| VC || |
| 41 | || ||Verb with a proclitic preposition and an enclitic conjunction|| VPC || |
| 42 | || ||Verb (all other)|| V || |
| 43 | ||Adjective||Adjective attached with preposition|| ADJP || |
| 44 | || ||Adjective attached with conjunctions|| ADJC || |
| 45 | || ||Adjective with a proclitic preposition and an enclitic conjunction|| ADJPC || |
| 46 | || ||Any other Adjective|| ADJ || |
| 47 | ||Preposition||Preposition|| PREP || |
| 48 | ||Conjunction||Conjunction|| CONJ || |
| 49 | ||Adverb||Adverb|| ADV || |
| 50 | ||Numeral||Cardinal|| NUMCR || |
| 51 | || ||Ordinal|| NUMOR || |
| 52 | || ||Numeral attached with preposition|| NUMP || |
| 53 | || ||Numeral attached with conjunction|| NUMC || |
| 54 | || ||Numeral with aproclitic preposition and an enclitic conjunction|| NUMPC || |
| 55 | ||Interjection||Interjections|| INT || |
| 56 | ||Punctuation||Punctuation|| PUNC || |
| 57 | ||Unclassified||Unclassified|| UNC || |
| 58 | |
| 59 | === Tag frequencies === |
| 60 | |
| 61 | The most frequent parts of speech in the corpus are nouns. The most frequent part of speech tags: |
| 62 | ||=Part of speech tag =||=Token count =|| |
| 63 | ||N|| 1,676,460|| |
| 64 | ||PUNC|| 135,685|| |
| 65 | ||NP|| 135,676|| |
| 66 | ||SENT|| 116,574|| |
| 67 | ||V|| 106,615|| |
| 68 | ||NUMCR|| 91,516|| |
| 69 | ||VP|| 62,990|| |
| 70 | ||NC|| 60,589|| |
| 71 | ||ADJ|| 56,009|| |
| 72 | ||VN|| 21,778|| |
| 73 | ||VREL|| 16,530|| |
| 74 | ||CONJ|| 11,953|| |
| 75 | ||ADV|| 10,193|| |
| 76 | ||PREP|| 7,444|| |
| 77 | ||NPC|| 5,954|| |
| 78 | ||UNC|| 5,723|| |
| 79 | ||VPC|| 4,532|| |
| 80 | ||ADJC|| 1,455|| |
| 81 | ||PRON|| 1,219|| |
| 82 | ||ADJPC|| 1,016|| |
| 83 | |