Changes between Version 3 and Version 4 of AmharicCorpus
- Timestamp:
- Jan 16, 2017, 7:02:13 PM (7 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
AmharicCorpus
v3 v4 1 1 = D3.1a: An Amharic corpus, sized 20 million words = 2 2 3 == Building the Amharic web corpus ==3 == Building the Amharic Web corpus == 4 4 We have used the following steps to create a big Web corpus: First, adopting the Corpus factory method [1] bigrams of Amharic words from the Crúbadán database [2] were used to query Bing search engine for documents in Amharic. 354 queries yielded 6,453 URLs. URLs of 3,145 successfully downloaded documents were used as starting points for web crawler SpiderLing [3]. URLs of documents crawled in 2013 using a similar approach were added to the set of starting points. 5 5 … … 17 17 18 18 The size of corpus structures: 19 ||=Document count =|| 33,542 20 ||=Paragraph count =|| 341,327 21 ||=Sentence count =|| 1,208,926 22 ||=Token count =|| 20,287,250 23 ||=Ge'ez lexicon size=|| 955,628 24 ||=Sera lexicon size =|| 948,553 19 ||=Document count =|| 33,542|| 20 ||=Paragraph count =|| 341,327|| 21 ||=Sentence count =|| 1,208,926|| 22 ||=Token count =|| 20,287,250|| 23 ||=Ge'ez lexicon size=|| 955,628|| 24 ||=Sera lexicon size =|| 948,553|| 25 25 26 26 Document count – the most frequent web domains and domain size distribution: 27 27 ||||= Top level domains =||||= Web domains =||||= Domain size distribution =|| 28 ||org || 14,582 || *.jw.org || 6,717 || At least 1000 documents || 7||29 ||com || 11,927 || *.gov.et || 4,599 || At least 500 documents || 15||30 ||et || 5,090 || waltainfo.com || 2,818 || At least 100 documents || 42||31 ||net || 1,084 || ginbot7.org || 2,666 || At least 50 documents || 63||32 ||cz || 724 || eotcmk.org || 1,141 || At least 10 documents || 149||33 ||info || 85 || ethsat.com || 894 || At least 1 document || 573||28 ||org || 14,582|| *.jw.org || 6,717|| At least 1000 documents || 7|| 29 ||com || 11,927|| *.gov.et || 4,599|| At least 500 documents || 15|| 30 ||et || 5,090|| waltainfo.com || 2,818|| At least 100 documents || 42|| 31 ||net || 1,084|| ginbot7.org || 2,666|| At least 50 documents || 63|| 32 ||cz || 724|| eotcmk.org || 1,141|| At least 10 documents || 149|| 33 ||info || 85|| ethsat.com || 894|| At least 1 document || 573|| 34 34 35 35 We observe the content of news/politic and religious portals has a significant presence in the corpus sources. Since there are only 149 domains with more than 10 documents represented in the corpus, the result collection would benefit from a greater variety of sources. … … 37 37 The most frequent parts of speech in both corpora are nouns and verbs. The most frequent part of speech tags: 38 38 ||=Part of speech tag =||=Token count =|| 39 ||N || 7,386,470 40 ||NP || 2,660,200 41 ||VP || 1,601,728 42 ||V || 1,331,531 43 ||SENT || 946,905 44 ||VREL || 920,223 45 ||PUNC || 741,439 46 ||PREP || 729,404 47 ||NUMCR || 687,686 48 ||ADJ || 647,608 49 ||PRON || 391,243 50 ||VN || 389,152 51 ||AUX || 373,346 52 ||NC || 322,592 53 ||CONJ || 292,046 54 ||PRONP || 204,243 55 ||ADV || 173,772 56 ||NPC || 140,109 57 ||ADJP || 126,138 39 ||N || 7,386,470|| 40 ||NP || 2,660,200|| 41 ||VP || 1,601,728|| 42 ||V || 1,331,531|| 43 ||SENT || 946,905|| 44 ||VREL || 920,223|| 45 ||PUNC || 741,439|| 46 ||PREP || 729,404|| 47 ||NUMCR || 687,686|| 48 ||ADJ || 647,608|| 49 ||PRON || 391,243|| 50 ||VN || 389,152|| 51 ||AUX || 373,346|| 52 ||NC || 322,592|| 53 ||CONJ || 292,046|| 54 ||PRONP || 204,243|| 55 ||ADV || 173,772|| 56 ||NPC || 140,109|| 57 ||ADJP || 126,138|| 58 58 59 59 == References ==