Changes between Version 6 and Version 7 of AmharicCorpus


Ignore:
Timestamp:
Jan 17, 2017, 11:07:46 AM (7 years ago)
Author:
hales
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • AmharicCorpus

    v6 v7  
    22
    33== Building the Amharic Web corpus ==
    4 We have used the following steps to create a big Web corpus: First, adopting the Corpus factory method [1] bigrams of Amharic words from the Crúbadán database [2] were used to query Bing search engine for documents in Amharic. 354 queries yielded 6,453 URLs. URLs of 3,145 successfully downloaded documents were used as starting points for web crawler SpiderLing [3]. URLs of documents crawled in 2013 using a similar approach were added to the set of starting points.
     4We have used the following steps to create a big Web corpus: First, adopting the Corpus factory method [1] bigrams of Amharic words from the Crúbadán database [2] were used to query Bing search engine for documents in Amharic. 354 queries yielded 6,453 URLs. URLs of 3,145 successfully downloaded documents were used as starting points for web crawler !SpiderLing [3]. URLs of documents crawled in 2013 using a similar approach were added to the set of starting points.
    55
    66The following language models were created:
     
    1111The crawler was set to harvest web domains in the Ethiopian national top level domain et and other general TLDs: com, org, info, net, edu. 3.6 GB of http responses was gathered in the process. HTML tags and boilerplate paragraphs were removed from the raw data.
    121242 % of paragraphs were identified as duplicate or near duplicate and removed using tool onion [4]. 66 MB of deduplicated text obtained by the same process in 2013 was added to the data. Sentence boundaries were marked at positions with Amharic end of sentence characters "።" and "፧". The final size of the corpus (containing data from years 2013, 2015 and 2016) is 461 MB of 20.3 million tokens.
    13 Finally, the corpus was tagged by TreeTagger with a model trained on the cleaned version of the WIC Corpus. The corpus is called amWaC16 (Amharic `Web as Corpus' corpus, year 2016).
     13Finally, the corpus was tagged by !TreeTagger with a model trained on the cleaned version of the WIC Corpus. The corpus is called amWaC16 (Amharic `Web as Corpus' corpus, year 2016).
    1414
    1515== Corpus properties ==