wiki:TigrinyaCorpusMorpho

D3.3b: A Tigrinya corpus, sized 3 million words, morphologically annotated

Building the Tigrinya Web corpus

The Building of the corpus is described at TigrinyaCorpus.

Corpus properties

Basic properties of corpus sources are summarised below.

The size of corpus structures:

Document count 1,907
Paragraph count 28,552
Sentence count 139,357
Token count 2,531,443
Ge'ez script lexicon size 225,132
Sera transliteration lexicon size 220,935

Morphological annotation

The corpus was tagged by TreeTagger with a model trained on the cleaned version of the WIC Corpus [1].

Manual evaluation of the tagging accuracy will be done in the next phase of the HaBiT project.

Tag-set

Basic ClassDefinition of the tagTag
NounVerbal/infinitival noun, formed from any verb VN
Noun attached with a preposition NP
Noun attached with conjunction NC
Noun with a proclitic preposition and an enclitic conjunction NPC
Any other noun N
PronounPronoun attached with preposition PRONP
Pronoun attached with conjunction PRONC
Pronoun with a proclitic preposition and an enclitic conjunction PRONPC
Any other Pronoun PRON
VerbAuxiliary verb AUX
Relative verb VREL
Verb attached with preposition VP
Verb attached with conjunction VC
Verb with a proclitic preposition and an enclitic conjunction VPC
Verb (all other) V
AdjectiveAdjective attached with preposition ADJP
Adjective attached with conjunctions ADJC
Adjective with a proclitic preposition and an enclitic conjunction ADJPC
Any other Adjective ADJ
PrepositionPreposition PREP
ConjunctionConjunction CONJ
AdverbAdverb ADV
NumeralCardinal NUMCR
Ordinal NUMOR
Numeral attached with preposition NUMP
Numeral attached with conjunction NUMC
Numeral with aproclitic preposition and an enclitic conjunction NUMPC
InterjectionInterjections INT
PunctuationPunctuation PUNC
UnclassifiedUnclassified UNC

Tag frequencies

The most frequent parts of speech in the corpus are nouns. The most frequent part of speech tags:

Part of speech tag Token count
N 1,676,460
PUNC 135,685
NP 135,676
SENT 116,574
V 106,615
NUMCR 91,516
VP 62,990
NC 60,589
ADJ 56,009
VN 21,778
VREL 16,530
CONJ 11,953
ADV 10,193
PREP 7,444
NPC 5,954
UNC 5,723
VPC 4,532
ADJC 1,455
PRON 1,219
ADJPC 1,016

Corpus query interface

The corpus has been indexed by corpus manager and query system Sketch Engine [2]. The corpus can be searched at http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=tiwac16.

References

  • [1] -- RYCHLÝ, Pavel and Vít SUCHOMEL. Annotated Amharic Corpora. In Proceedings of Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno
  • [2] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.
Last modified 7 years ago Last modified on Jan 19, 2017, 2:12:27 AM