Context Navigation

D3.4b: A Somali corpus, sized 10 million words, morphologically annotated

The Building of the corpus is described at SomaliCorpus.

Basic properties of corpus sources are summarised below.

The size of corpus structures:

The corpus was tagged by a simple tagger based on regular expressions and a lexicon of most frequent words.

Manual evaluation of the tagging accuracy will be done in the next phase of the HaBiT project.

The tag-set is based on the POS tags of Universal Dependences [2].

The most frequent parts of speech in the corpus are nouns. The most frequent part of speech tags:

The corpus has been indexed by corpus manager and query system Sketch Engine [1]. The corpus can be searched at http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=sowac16.

[1] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.
[2] -- Nivre, Joakim, et al. "Universal dependencies v1: A multilingual treebank collection." Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). 2016.

Last modified 9 years ago Last modified on Jan 19, 2017, 2:36:47 AM

Note: See TracWiki for help on using the wiki.