Context Navigation

Version 3 (modified by hales, 9 years ago) ( diff )
--

D3.1b: An Amharic corpus, sized 20 million words, morphologically annotated

The Building of the corpus is described at AmharicCorpus.

Basic properties of corpus sources are summarised below.

The size of corpus structures:

The corpus was tagged by TreeTagger with a model trained on the cleaned version of the WIC Corpus [1].

The most frequent parts of speech in both corpora are nouns and verbs. The most frequent part of speech tags:

The corpus has been indexed by corpus manager and query system Sketch Engine [2]. The corpus can be searched at http://corpora.fi.muni.cz/habit/.

[1] -- RYCHLÝ, Pavel and Vít SUCHOMEL. Annotated Amharic Corpora. In Proceedings of Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno
[2] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.

Note: See TracWiki for help on using the wiki.