Changes between Version 1 and Version 2 of AmharicCorpusMorpho


Ignore:
Timestamp:
Jan 17, 2017, 9:56:11 AM (7 years ago)
Author:
pary
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • AmharicCorpusMorpho

    v1 v2  
    11= D3.1b: An Amharic corpus, sized 20 million words, morphologically annotated =
     2
     3== Building the Amharic Web corpus ==
     4
     5The Building of the corpus is described at [[AmharicCorpus]].
     6
     7
     8== Corpus properties ==
     9Basic properties of corpus sources are summarised below.
     10
     11The size of corpus structures:
     12||=Document count    =||     33,542||
     13||=Paragraph count   =||    341,327||
     14||=Sentence count    =||  1,208,926||
     15||=Token count       =|| 20,287,250||
     16||=Ge'ez lexicon size=||    955,628||
     17||=Sera lexicon size =||    948,553||
     18
     19== Morphological annotation ==
     20
     21The corpus was tagged by TreeTagger with a model trained on the cleaned version of the WIC Corpus [1].
     22
     23
     24The most frequent parts of speech in both corpora are nouns and verbs. The most frequent part of speech tags:
     25||=Part of speech tag =||=Token count =||
     26||N     || 7,386,470||
     27||NP    || 2,660,200||
     28||VP    || 1,601,728||
     29||V     || 1,331,531||
     30||SENT  ||   946,905||
     31||VREL  ||   920,223||
     32||PUNC  ||   741,439||
     33||PREP  ||   729,404||
     34||NUMCR ||   687,686||
     35||ADJ   ||   647,608||
     36||PRON  ||   391,243||
     37||VN    ||   389,152||
     38||AUX   ||   373,346||
     39||NC    ||   322,592||
     40||CONJ  ||   292,046||
     41||PRONP ||   204,243||
     42||ADV   ||   173,772||
     43||NPC   ||   140,109||
     44||ADJP  ||   126,138||
     45
     46== Corpus query interface ==
     47The corpus has been indexed by corpus manager and query system Sketch Engine [2]. The corpus can be searched at http://corpora.fi.muni.cz/habit/.
     48
     49== References ==
     50 - [1] -- RYCHLÝ, Pavel and Vít SUCHOMEL. Annotated Amharic Corpora. In Proceedings of Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno
     51 - [2] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.