= D3.3b: A Tigrinya corpus, sized 3 million words, morphologically annotated =

== Building the Tigrinya Web corpus ==

The Building of the corpus is described at [[TigrinyaCorpus]].

== Corpus properties ==
Basic properties of corpus sources are summarised below.

The size of corpus structures:
||=Document count    =||     1,907||
||=Paragraph count   =||    28,552||
||=Sentence count    =||   139,357||
||=Token count       =|| 2,531,443||
||=Ge'ez script lexicon size =||   225,132||
||=Sera transliteration lexicon size  =||   220,935||

== Morphological annotation ==

The corpus was tagged by !TreeTagger with a model trained on the cleaned version of the WIC Corpus [1].

Manual evaluation of the tagging accuracy will be done in the next phase of the HaBiT project.


=== Tag-set ===

||=Basic Class=||=Definition of the tag=||=Tag=||
||Noun||Verbal/infinitival noun, formed from any verb|| VN ||
|| ||Noun attached with a preposition|| NP ||
|| ||Noun attached with conjunction|| NC ||
|| ||Noun with a proclitic preposition and an enclitic conjunction|| NPC ||
|| ||Any other noun|| N ||
||Pronoun||Pronoun attached with preposition|| PRONP ||
|| ||Pronoun attached with conjunction|| PRONC ||
|| ||Pronoun with a proclitic preposition and an enclitic conjunction|| PRONPC ||
|| ||Any other Pronoun|| PRON ||
||Verb||Auxiliary verb|| AUX ||
|| ||Relative verb|| VREL ||
|| ||Verb attached with preposition|| VP ||
|| ||Verb attached with conjunction|| VC ||
|| ||Verb with a proclitic preposition and an enclitic conjunction|| VPC ||
|| ||Verb (all other)|| V ||
||Adjective||Adjective attached with preposition|| ADJP ||
|| ||Adjective attached with conjunctions|| ADJC ||
|| ||Adjective with a proclitic preposition and an enclitic conjunction|| ADJPC  ||
|| ||Any other Adjective|| ADJ ||
||Preposition||Preposition|| PREP ||
||Conjunction||Conjunction|| CONJ ||
||Adverb||Adverb|| ADV ||
||Numeral||Cardinal|| NUMCR ||
|| ||Ordinal|| NUMOR ||
|| ||Numeral attached with preposition|| NUMP ||
|| ||Numeral attached with conjunction|| NUMC ||
|| ||Numeral with aproclitic preposition and an enclitic conjunction|| NUMPC ||
||Interjection||Interjections|| INT ||
||Punctuation||Punctuation|| PUNC ||
||Unclassified||Unclassified|| UNC  ||

=== Tag frequencies ===

The most frequent parts of speech in the corpus are nouns. The most frequent part of speech tags:
||=Part of speech tag =||=Token count =||
||N|| 1,676,460||
||PUNC|| 135,685||
||NP|| 135,676||
||SENT|| 116,574||
||V|| 106,615||
||NUMCR|| 91,516||
||VP|| 62,990||
||NC|| 60,589||
||ADJ|| 56,009||
||VN|| 21,778||
||VREL|| 16,530||
||CONJ|| 11,953||
||ADV|| 10,193||
||PREP|| 7,444||
||NPC|| 5,954||
||UNC|| 5,723||
||VPC|| 4,532||
||ADJC|| 1,455||
||PRON|| 1,219||
||ADJPC|| 1,016||

== Corpus query interface ==
The corpus has been indexed by corpus manager and query system Sketch Engine [2]. The corpus can be searched at http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=tiwac16.

== References ==
 - [1] -- RYCHLÝ, Pavel and Vít SUCHOMEL. Annotated Amharic Corpora. In Proceedings of Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno
 - [2] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.