Context Navigation

Changes between Version 3 and Version 4 of AmharicCorpus

Timestamp:: Jan 16, 2017, 7:02:13 PM (9 years ago)
Author:: xsuchom2
Comment:: table alignment

Legend:

: Unmodified
: Added
: Removed
: Modified

AmharicCorpus

-              v3
+              v4
 = D3.1a: An Amharic corpus, sized 20 million words =
 == Building the Amharic web corpus ==
+== Building the Amharic Web corpus ==
 We have used the following steps to create a big Web corpus: First, adopting the Corpus factory method [1] bigrams of Amharic words from the Crúbadán database [2] were used to query Bing search engine for documents in Amharic. 354 queries yielded 6,453 URLs. URLs of 3,145 successfully downloaded documents were used as starting points for web crawler SpiderLing [3]. URLs of documents crawled in 2013 using a similar approach were added to the set of starting points.
 …
 The size of corpus structures:
 ||=Document count    =||     33,542 ||
 ||=Paragraph count   =||    341,327 ||
 ||=Sentence count    =||  1,208,926 ||
 ||=Token count       =|| 20,287,250 ||
 ||=Ge'ez lexicon size=||    955,628 ||
 ||=Sera lexicon size =||    948,553 ||
+||=Document count    =||     33,542||
+||=Paragraph count   =||    341,327||
+||=Sentence count    =||  1,208,926||
+||=Token count       =|| 20,287,250||
+||=Ge'ez lexicon size=||    955,628||
+||=Sera lexicon size =||    948,553||
 Document count – the most frequent web domains and domain size distribution:
 ||||= Top level domains =||||= Web domains =||||= Domain size distribution =||
 ||org  || 14,582  || *.jw.org       || 6,717  || At least 1000 documents  ||   7 ||
 ||com  || 11,927  || *.gov.et       || 4,599  || At least 500 documents   ||  15 ||
 ||et   ||  5,090  || waltainfo.com  || 2,818  || At least 100 documents   ||  42 ||
 ||net  ||  1,084  || ginbot7.org    || 2,666  || At least 50 documents    ||  63 ||
 ||cz   ||    724  || eotcmk.org     || 1,141  || At least 10 documents    || 149 ||
 ||info ||     85  || ethsat.com     ||   894  || At least 1 document      || 573 ||
+||org  || 14,582|| *.jw.org       || 6,717|| At least 1000 documents  ||   7||
+||com  || 11,927|| *.gov.et       || 4,599|| At least 500 documents   ||  15||
+||et   ||  5,090|| waltainfo.com  || 2,818|| At least 100 documents   ||  42||
+||net  ||  1,084|| ginbot7.org    || 2,666|| At least 50 documents    ||  63||
+||cz   ||    724|| eotcmk.org     || 1,141|| At least 10 documents    || 149||
+||info ||     85|| ethsat.com     ||   894|| At least 1 document      || 573||
 We observe the content of news/politic and religious portals has a significant presence in the corpus sources. Since there are only 149 domains with more than 10 documents represented in the corpus, the result collection would benefit from a greater variety of sources.
 …
 The most frequent parts of speech in both corpora are nouns and verbs. The most frequent part of speech tags:
 ||=Part of speech tag =||=Token count =||
 ||N     || 7,386,470 ||
 ||NP    || 2,660,200 ||
 ||VP    || 1,601,728 ||
 ||V     || 1,331,531 ||
 ||SENT  ||   946,905 ||
 ||VREL  ||   920,223 ||
 ||PUNC  ||   741,439 ||
 ||PREP  ||   729,404 ||
 ||NUMCR ||   687,686 ||
 ||ADJ   ||   647,608 ||
 ||PRON  ||   391,243 ||
 ||VN    ||   389,152 ||
 ||AUX   ||   373,346 ||
 ||NC    ||   322,592 ||
 ||CONJ  ||   292,046 ||
 ||PRONP ||   204,243 ||
 ||ADV   ||   173,772 ||
 ||NPC   ||   140,109 ||
 ||ADJP  ||   126,138 ||
+||N     || 7,386,470||
+||NP    || 2,660,200||
+||VP    || 1,601,728||
+||V     || 1,331,531||
+||SENT  ||   946,905||
+||VREL  ||   920,223||
+||PUNC  ||   741,439||
+||PREP  ||   729,404||
+||NUMCR ||   687,686||
+||ADJ   ||   647,608||
+||PRON  ||   391,243||
+||VN    ||   389,152||
+||AUX   ||   373,346||
+||NC    ||   322,592||
+||CONJ  ||   292,046||
+||PRONP ||   204,243||
+||ADV   ||   173,772||
+||NPC   ||   140,109||
+||ADJP  ||   126,138||
 == References ==