Context Navigation

Changes between Version 6 and Version 7 of AmharicCorpus

Timestamp:: Jan 17, 2017, 11:07:46 AM (9 years ago)
Author:: hales
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

AmharicCorpus

-              v6
+              v7
 == Building the Amharic Web corpus ==
 We have used the following steps to create a big Web corpus: First, adopting the Corpus factory method [1] bigrams of Amharic words from the Crúbadán database [2] were used to query Bing search engine for documents in Amharic. 354 queries yielded 6,453 URLs. URLs of 3,145 successfully downloaded documents were used as starting points for web crawler SpiderLing [3]. URLs of documents crawled in 2013 using a similar approach were added to the set of starting points.
+We have used the following steps to create a big Web corpus: First, adopting the Corpus factory method [1] bigrams of Amharic words from the Crúbadán database [2] were used to query Bing search engine for documents in Amharic. 354 queries yielded 6,453 URLs. URLs of 3,145 successfully downloaded documents were used as starting points for web crawler !SpiderLing [3]. URLs of documents crawled in 2013 using a similar approach were added to the set of starting points.
 The following language models were created:
 …
 The crawler was set to harvest web domains in the Ethiopian national top level domain et and other general TLDs: com, org, info, net, edu. 3.6 GB of http responses was gathered in the process. HTML tags and boilerplate paragraphs were removed from the raw data.
 % of paragraphs were identified as duplicate or near duplicate and removed using tool onion [4]. 66 MB of deduplicated text obtained by the same process in 2013 was added to the data. Sentence boundaries were marked at positions with Amharic end of sentence characters "።" and "፧". The final size of the corpus (containing data from years 2013, 2015 and 2016) is 461 MB of 20.3 million tokens.
 Finally, the corpus was tagged by TreeTagger with a model trained on the cleaned version of the WIC Corpus. The corpus is called amWaC16 (Amharic `Web as Corpus' corpus, year 2016).
+Finally, the corpus was tagged by !TreeTagger with a model trained on the cleaned version of the WIC Corpus. The corpus is called amWaC16 (Amharic `Web as Corpus' corpus, year 2016).
 == Corpus properties ==