Changes between Version 1 and Version 2 of SetOfEthiopianWebCorpora
- Timestamp:
- May 31, 2017, 6:51:46 PM (7 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
SetOfEthiopianWebCorpora
v1 v2 3 3 * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=wic&reload=1 Amharic WIC corpus], 200 thousand tokens 4 4 5 Amharic WIC corpus (News from Walta Information Center), manually tagged. 5 Amharic WIC corpus (News from Walta Information Center), manually tagged. [[BR]] 6 [https://nlp.fi.muni.cz/projects/habit/download/wic.vert.gz download corpus] 6 7 7 8 * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=amwac16&reload=1 amWaC16 corpus], 20 million tokens 8 9 9 Amharic Web corpus. Crawled by !SpiderLing in August 2013, October 2015 and January 2016. Cleaned, de-duplicated. Tagged by !TreeTagger trained on Amharic WiC. [[BR]] [AmharicCorpus Corpus deliverable/technical report] 10 Amharic Web corpus. Crawled by !SpiderLing in August 2013, October 2015 and January 2016. Cleaned, de-duplicated. Tagged by !TreeTagger trained on Amharic WiC. [[BR]] [AmharicCorpus Corpus deliverable/technical report][[BR]] 11 [https://nlp.fi.muni.cz/projects/habit/download/am131516.vert.gz download corpus] 10 12 11 13 * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=orwac16 orWaC16 corpus], 5.1 million tokens. 12 14 13 Oromo Web corpus crawled by !SpiderLing in January 2016. Cleaned, de-duplicated. [[BR]] [OromoCorpus Corpus deliverable/technical report] 15 Oromo Web corpus crawled by !SpiderLing in January 2016. Cleaned, de-duplicated. [[BR]] [OromoCorpus Corpus deliverable/technical report][[BR]] 16 [https://nlp.fi.muni.cz/projects/habit/download/or16.tag.vert.gz download corpus] 14 17 15 18 * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=sowac16 soWaC16 corpus], 80 million tokens. 16 19 17 Somali Web corpus crawled by !SpiderLing in January 2016. Cleaned, de-duplicated. [[BR]] [SomaliCorpus Corpus deliverable/technical report] 20 Somali Web corpus crawled by !SpiderLing in January 2016. Cleaned, de-duplicated. [[BR]] [SomaliCorpus Corpus deliverable/technical report][[BR]] 21 [https://nlp.fi.muni.cz/projects/habit/download/so16.tag.vert.gz download corpus] 18 22 19 23 * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=tiwac16 tiWaC16 corpus], 2.5 million tokens. 20 24 21 Tigrinya Web corpus crawled by !SpiderLing in January 2016. Cleaned, de-duplicated. [[BR]] [TigrinyaCorpus Corpus deliverable/technical report] 25 Tigrinya Web corpus crawled by !SpiderLing in January 2016. Cleaned, de-duplicated. [[BR]] [TigrinyaCorpus Corpus deliverable/technical report][[BR]] 26 [https://nlp.fi.muni.cz/projects/habit/download/ti16.tag.vert.gz download corpus] 22 27 23 28 === Software ===