= D3.3b: A Tigrinya corpus, sized 3 million words, morphologically annotated = == Building the Tigrinya Web corpus == The Building of the corpus is described at [[TigrinyaCorpus]]. == Corpus properties == Basic properties of corpus sources are summarised below. The size of corpus structures: ||=Document count =|| 1,907|| ||=Paragraph count =|| 28,552|| ||=Sentence count =|| 139,357|| ||=Token count =|| 2,531,443|| ||=Ge'ez script lexicon size =|| 225,132|| ||=Sera transliteration lexicon size =|| 220,935|| == Morphological annotation == The corpus was tagged by !TreeTagger with a model trained on the cleaned version of the WIC Corpus [1]. Manual evaluation of the tagging accuracy will be done in the next phase of the HaBiT project. === Tag-set === ||=Basic Class=||=Definition of the tag=||=Tag=|| ||Noun||Verbal/infinitival noun, formed from any verb|| VN || || ||Noun attached with a preposition|| NP || || ||Noun attached with conjunction|| NC || || ||Noun with a proclitic preposition and an enclitic conjunction|| NPC || || ||Any other noun|| N || ||Pronoun||Pronoun attached with preposition|| PRONP || || ||Pronoun attached with conjunction|| PRONC || || ||Pronoun with a proclitic preposition and an enclitic conjunction|| PRONPC || || ||Any other Pronoun|| PRON || ||Verb||Auxiliary verb|| AUX || || ||Relative verb|| VREL || || ||Verb attached with preposition|| VP || || ||Verb attached with conjunction|| VC || || ||Verb with a proclitic preposition and an enclitic conjunction|| VPC || || ||Verb (all other)|| V || ||Adjective||Adjective attached with preposition|| ADJP || || ||Adjective attached with conjunctions|| ADJC || || ||Adjective with a proclitic preposition and an enclitic conjunction|| ADJPC || || ||Any other Adjective|| ADJ || ||Preposition||Preposition|| PREP || ||Conjunction||Conjunction|| CONJ || ||Adverb||Adverb|| ADV || ||Numeral||Cardinal|| NUMCR || || ||Ordinal|| NUMOR || || ||Numeral attached with preposition|| NUMP || || ||Numeral attached with conjunction|| NUMC || || ||Numeral with aproclitic preposition and an enclitic conjunction|| NUMPC || ||Interjection||Interjections|| INT || ||Punctuation||Punctuation|| PUNC || ||Unclassified||Unclassified|| UNC || === Tag frequencies === The most frequent parts of speech in the corpus are nouns. The most frequent part of speech tags: ||=Part of speech tag =||=Token count =|| ||N|| 1,676,460|| ||PUNC|| 135,685|| ||NP|| 135,676|| ||SENT|| 116,574|| ||V|| 106,615|| ||NUMCR|| 91,516|| ||VP|| 62,990|| ||NC|| 60,589|| ||ADJ|| 56,009|| ||VN|| 21,778|| ||VREL|| 16,530|| ||CONJ|| 11,953|| ||ADV|| 10,193|| ||PREP|| 7,444|| ||NPC|| 5,954|| ||UNC|| 5,723|| ||VPC|| 4,532|| ||ADJC|| 1,455|| ||PRON|| 1,219|| ||ADJPC|| 1,016|| == Corpus query interface == The corpus has been indexed by corpus manager and query system Sketch Engine [2]. The corpus can be searched at http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=tiwac16. == References == - [1] -- RYCHLÝ, Pavel and Vít SUCHOMEL. Annotated Amharic Corpora. In Proceedings of Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno - [2] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.