Changes between Version 3 and Version 4 of AmharicCorpus


Ignore:
Timestamp:
Jan 16, 2017, 7:02:13 PM (7 years ago)
Author:
xsuchom2
Comment:

table alignment

Legend:

Unmodified
Added
Removed
Modified
  • AmharicCorpus

    v3 v4  
    11= D3.1a: An Amharic corpus, sized 20 million words =
    22
    3 == Building the Amharic web corpus ==
     3== Building the Amharic Web corpus ==
    44We have used the following steps to create a big Web corpus: First, adopting the Corpus factory method [1] bigrams of Amharic words from the Crúbadán database [2] were used to query Bing search engine for documents in Amharic. 354 queries yielded 6,453 URLs. URLs of 3,145 successfully downloaded documents were used as starting points for web crawler SpiderLing [3]. URLs of documents crawled in 2013 using a similar approach were added to the set of starting points.
    55
     
    1717
    1818The size of corpus structures:
    19 ||=Document count    =||     33,542 ||
    20 ||=Paragraph count   =||    341,327 ||
    21 ||=Sentence count    =||  1,208,926 ||
    22 ||=Token count       =|| 20,287,250 ||
    23 ||=Ge'ez lexicon size=||    955,628 ||
    24 ||=Sera lexicon size =||    948,553 ||
     19||=Document count    =||     33,542||
     20||=Paragraph count   =||    341,327||
     21||=Sentence count    =||  1,208,926||
     22||=Token count       =|| 20,287,250||
     23||=Ge'ez lexicon size=||    955,628||
     24||=Sera lexicon size =||    948,553||
    2525
    2626Document count – the most frequent web domains and domain size distribution:
    2727||||= Top level domains =||||= Web domains =||||= Domain size distribution =||
    28 ||org  || 14,582  || *.jw.org       || 6,717  || At least 1000 documents  ||   7 ||
    29 ||com  || 11,927  || *.gov.et       || 4,599  || At least 500 documents   ||  15 ||
    30 ||et   ||  5,090  || waltainfo.com  || 2,818  || At least 100 documents   ||  42 ||
    31 ||net  ||  1,084  || ginbot7.org    || 2,666  || At least 50 documents    ||  63 ||
    32 ||cz   ||    724  || eotcmk.org     || 1,141  || At least 10 documents    || 149 ||
    33 ||info ||     85  || ethsat.com     ||   894  || At least 1 document      || 573 ||
     28||org  || 14,582|| *.jw.org       || 6,717|| At least 1000 documents  ||   7||
     29||com  || 11,927|| *.gov.et       || 4,599|| At least 500 documents   ||  15||
     30||et   ||  5,090|| waltainfo.com  || 2,818|| At least 100 documents   ||  42||
     31||net  ||  1,084|| ginbot7.org    || 2,666|| At least 50 documents    ||  63||
     32||cz   ||    724|| eotcmk.org     || 1,141|| At least 10 documents    || 149||
     33||info ||     85|| ethsat.com     ||   894|| At least 1 document      || 573||
    3434
    3535We observe the content of news/politic and religious portals has a significant presence in the corpus sources. Since there are only 149 domains with more than 10 documents represented in the corpus, the result collection would benefit from a greater variety of sources.
     
    3737The most frequent parts of speech in both corpora are nouns and verbs. The most frequent part of speech tags:
    3838||=Part of speech tag =||=Token count =||
    39 ||N     || 7,386,470 ||
    40 ||NP    || 2,660,200 ||
    41 ||VP    || 1,601,728 ||
    42 ||V     || 1,331,531 ||
    43 ||SENT  ||   946,905 ||
    44 ||VREL  ||   920,223 ||
    45 ||PUNC  ||   741,439 ||
    46 ||PREP  ||   729,404 ||
    47 ||NUMCR ||   687,686 ||
    48 ||ADJ   ||   647,608 ||
    49 ||PRON  ||   391,243 ||
    50 ||VN    ||   389,152 ||
    51 ||AUX   ||   373,346 ||
    52 ||NC    ||   322,592 ||
    53 ||CONJ  ||   292,046 ||
    54 ||PRONP ||   204,243 ||
    55 ||ADV   ||   173,772 ||
    56 ||NPC   ||   140,109 ||
    57 ||ADJP  ||   126,138 ||
     39||N     || 7,386,470||
     40||NP    || 2,660,200||
     41||VP    || 1,601,728||
     42||V     || 1,331,531||
     43||SENT  ||   946,905||
     44||VREL  ||   920,223||
     45||PUNC  ||   741,439||
     46||PREP  ||   729,404||
     47||NUMCR ||   687,686||
     48||ADJ   ||   647,608||
     49||PRON  ||   391,243||
     50||VN    ||   389,152||
     51||AUX   ||   373,346||
     52||NC    ||   322,592||
     53||CONJ  ||   292,046||
     54||PRONP ||   204,243||
     55||ADV   ||   173,772||
     56||NPC   ||   140,109||
     57||ADJP  ||   126,138||
    5858
    5959== References ==