Changes between Version 28 and Version 29 of InterimResults


Ignore:
Timestamp:
Jan 17, 2017, 12:52:15 PM (7 years ago)
Author:
xbaisa
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • InterimResults

    v28 v29  
    66The system includes selected corpus processing tools and the following HaBiT corpora:
    77
    8  * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=wic&reload=1 Amharic WIC corpus], 200 thousand tokens
     8 * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=wic&reload=1 Amharic WIC corpus], 200 thousand tokens
    99
    1010  Amharic WIC corpus (News from Walta Information Center), manually tagged.
    1111
    12  * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=amwac16&reload=1 Amharic WaC corpus], 17 million tokens
     12 * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=amwac16&reload=1 Amharic WaC corpus], 17 million tokens
    1313
    1414  Amharic web corpus. Crawled by !SpiderLing  in August 2013 and October 2015. Encoded in UTF-8, cleaned, deduplicated. Automatically tagged by !TreeTagger  trained on Amharic WiC
    1515
    16  * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=or_spoken Oromo spoken corpus], 7,500 tokens.
     16 * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=or_spoken Oromo spoken corpus], 7,500 tokens.
    1717
    1818  Oromo spoken corpus containing 1205 utterances. Built by Text Laboratory, University of Oslo.
    19  * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=orwac16 Oromo WaC corpus], 5.1 million tokens.
     19 * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=orwac16 Oromo WaC corpus], 5.1 million tokens.
    2020
    2121  Web corpus crawled by !SpiderLing in January 2016. Cleaned, de-duplicated.
    22  * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=sowac16 Somali WaC corpus], 80 million tokens.
     22 * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=sowac16 Somali WaC corpus], 80 million tokens.
    2323
    2424  Web corpus crawled by !SpiderLing in January 2016. Cleaned, de-duplicated.
    25  * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=tiwac16 Tigrinya WaC corpus], 2.5 million tokens.
     25 * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=tiwac16 Tigrinya WaC corpus], 2.5 million tokens.
    2626
    2727  Web corpus crawled by !SpiderLing in January 2016. Cleaned, de-duplicated.
     28
     29 * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=czech_norwegian_opus__norwegian Czech-Norwegian parallel corpus], 4 million aligned segments.
     30
     31  Czech-Norwegian parallel corpus from subtitles, OpenSubtitles2016 subcorpus of OPUS2, filtered for Czech and Norwegian.
    2832
    2933== Publications ==