wiki:SomaliCorpusMorpho

D3.4b: A Somali corpus, sized 10 million words, morphologically annotated

Building the Somali Web corpus

The Building of the corpus is described at SomaliCorpus.

Corpus properties

Basic properties of corpus sources are summarised below.

The size of corpus structures:

Document count 385,338
Paragraph count 1,937,758
Sentence count 2,643,336
Token count 79,741,231
Lexicon size 1,399,350

Morphological annotation

The corpus was tagged by a simple tagger based on regular expressions and a lexicon of most frequent words.

Manual evaluation of the tagging accuracy will be done in the next phase of the HaBiT project.

Tag-set

The tag-set is based on the POS tags of Universal Dependences [2].

TagDescription
ADJ adjective
ADP adposition
ADV adverb
AUX auxiliary
CONJ conjunction
DET determiner
NOUN noun
NUM numeral
PRON pronoun
PUNCT punctuation
SYM symbol
VERB verb

Tag frequencies

The most frequent parts of speech in the corpus are nouns. The most frequent part of speech tags:

Part of speech tag Token count
NOUN 42,711,639
PRON 8,458,321
PUNCT 6,527,366
PREP 4,796,940
CONJ 4,147,836
ADP 3,352,316
DET 2,605,149
VERB 2,483,263
ADJ 1,466,010
NUM 1,357,567
ADV 1,139,631
SYM 616,690

Corpus query interface

The corpus has been indexed by corpus manager and query system Sketch Engine [1]. The corpus can be searched at http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=sowac16.

References

  • [1] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.
  • [2] -- Nivre, Joakim, et al. "Universal dependencies v1: A multilingual treebank collection." Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). 2016.
Last modified 7 years ago Last modified on Jan 19, 2017, 2:36:47 AM