Changes between Version 28 and Version 29 of InterimResults
- Timestamp:
- Jan 17, 2017, 12:52:15 PM (7 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
InterimResults
v28 v29 6 6 The system includes selected corpus processing tools and the following HaBiT corpora: 7 7 8 * [http://corpora.fi.muni.cz/habit/run.cgi/first ?corpname=wic&reload=1 Amharic WIC corpus], 200 thousand tokens8 * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=wic&reload=1 Amharic WIC corpus], 200 thousand tokens 9 9 10 10 Amharic WIC corpus (News from Walta Information Center), manually tagged. 11 11 12 * [http://corpora.fi.muni.cz/habit/run.cgi/first ?corpname=amwac16&reload=1 Amharic WaC corpus], 17 million tokens12 * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=amwac16&reload=1 Amharic WaC corpus], 17 million tokens 13 13 14 14 Amharic web corpus. Crawled by !SpiderLing in August 2013 and October 2015. Encoded in UTF-8, cleaned, deduplicated. Automatically tagged by !TreeTagger trained on Amharic WiC 15 15 16 * [http://corpora.fi.muni.cz/habit/run.cgi/first ?corpname=or_spoken Oromo spoken corpus], 7,500 tokens.16 * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=or_spoken Oromo spoken corpus], 7,500 tokens. 17 17 18 18 Oromo spoken corpus containing 1205 utterances. Built by Text Laboratory, University of Oslo. 19 * [http://corpora.fi.muni.cz/habit/run.cgi/first ?corpname=orwac16 Oromo WaC corpus], 5.1 million tokens.19 * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=orwac16 Oromo WaC corpus], 5.1 million tokens. 20 20 21 21 Web corpus crawled by !SpiderLing in January 2016. Cleaned, de-duplicated. 22 * [http://corpora.fi.muni.cz/habit/run.cgi/first ?corpname=sowac16 Somali WaC corpus], 80 million tokens.22 * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=sowac16 Somali WaC corpus], 80 million tokens. 23 23 24 24 Web corpus crawled by !SpiderLing in January 2016. Cleaned, de-duplicated. 25 * [http://corpora.fi.muni.cz/habit/run.cgi/first ?corpname=tiwac16 Tigrinya WaC corpus], 2.5 million tokens.25 * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=tiwac16 Tigrinya WaC corpus], 2.5 million tokens. 26 26 27 27 Web corpus crawled by !SpiderLing in January 2016. Cleaned, de-duplicated. 28 29 * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=czech_norwegian_opus__norwegian Czech-Norwegian parallel corpus], 4 million aligned segments. 30 31 Czech-Norwegian parallel corpus from subtitles, OpenSubtitles2016 subcorpus of OPUS2, filtered for Czech and Norwegian. 28 32 29 33 == Publications ==