Context Navigation

D3.3b: A Tigrinya corpus, sized 3 million words, morphologically annotated

The Building of the corpus is described at TigrinyaCorpus.

Basic properties of corpus sources are summarised below.

The size of corpus structures:

Document count	1,907
Paragraph count	28,552
Sentence count	139,357
Token count	2,531,443
Ge'ez script lexicon size	225,132
Sera transliteration lexicon size	220,935

The corpus was tagged by TreeTagger with a model trained on the cleaned version of the WIC Corpus [1].

Manual evaluation of the tagging accuracy will be done in the next phase of the HaBiT project.

Basic Class	Definition of the tag	Tag
Noun	Verbal/infinitival noun, formed from any verb	VN
	Noun attached with a preposition	NP
	Noun attached with conjunction	NC
	Noun with a proclitic preposition and an enclitic conjunction	NPC
	Any other noun	N
Pronoun	Pronoun attached with preposition	PRONP
	Pronoun attached with conjunction	PRONC
	Pronoun with a proclitic preposition and an enclitic conjunction	PRONPC
	Any other Pronoun	PRON
Verb	Auxiliary verb	AUX
	Relative verb	VREL
	Verb attached with preposition	VP
	Verb attached with conjunction	VC
	Verb with a proclitic preposition and an enclitic conjunction	VPC
	Verb (all other)	V
Adjective	Adjective attached with preposition	ADJP
	Adjective attached with conjunctions	ADJC
	Adjective with a proclitic preposition and an enclitic conjunction	ADJPC
	Any other Adjective	ADJ
Preposition	Preposition	PREP
Conjunction	Conjunction	CONJ
Adverb	Adverb	ADV
Numeral	Cardinal	NUMCR
	Ordinal	NUMOR
	Numeral attached with preposition	NUMP
	Numeral attached with conjunction	NUMC
	Numeral with aproclitic preposition and an enclitic conjunction	NUMPC
Interjection	Interjections	INT
Punctuation	Punctuation	PUNC
Unclassified	Unclassified	UNC

The most frequent parts of speech in the corpus are nouns. The most frequent part of speech tags:

The corpus has been indexed by corpus manager and query system Sketch Engine [2]. The corpus can be searched at http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=tiwac16.

[1] -- RYCHLÝ, Pavel and Vít SUCHOMEL. Annotated Amharic Corpora. In Proceedings of Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno
[2] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.

Last modified 9 years ago Last modified on Jan 19, 2017, 2:12:27 AM

Note: See TracWiki for help on using the wiki.