Changes between Version 29 and Version 30 of InterimResults


Ignore:
Timestamp:
Jan 17, 2017, 12:56:33 PM (7 years ago)
Author:
xsuchom2
Comment:

Web corpora update

Legend:

Unmodified
Added
Removed
Modified
  • InterimResults

    v29 v30  
    1010  Amharic WIC corpus (News from Walta Information Center), manually tagged.
    1111
    12  * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=amwac16&reload=1 Amharic WaC corpus], 17 million tokens
     12 * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=amwac16&reload=1 Amharic WaC corpus], 20 million tokens
    1313
    14   Amharic web corpus. Crawled by !SpiderLing  in August 2013 and October 2015. Encoded in UTF-8, cleaned, deduplicated. Automatically tagged by !TreeTagger  trained on Amharic WiC
     14  Amharic Web corpus. Crawled by !SpiderLing  in August 2013, October 2015 and January 2016. Encoded in UTF-8, cleaned, deduplicated. Automatically tagged by !TreeTagger  trained on Amharic WiC
    1515
    1616 * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=or_spoken Oromo spoken corpus], 7,500 tokens.
     
    1919 * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=orwac16 Oromo WaC corpus], 5.1 million tokens.
    2020
    21   Web corpus crawled by !SpiderLing in January 2016. Cleaned, de-duplicated.
     21  Oromo Web corpus crawled by !SpiderLing in January 2016. Cleaned, de-duplicated.
     22
    2223 * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=sowac16 Somali WaC corpus], 80 million tokens.
    2324
    24   Web corpus crawled by !SpiderLing in January 2016. Cleaned, de-duplicated.
     25  Somali Web corpus crawled by !SpiderLing in January 2016. Cleaned, de-duplicated.
     26
    2527 * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=tiwac16 Tigrinya WaC corpus], 2.5 million tokens.
    2628
    27   Web corpus crawled by !SpiderLing in January 2016. Cleaned, de-duplicated.
     29  Tigrinya Web corpus crawled by !SpiderLing in January 2016. Cleaned, de-duplicated.
    2830
    2931 * [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=czech_norwegian_opus__norwegian Czech-Norwegian parallel corpus], 4 million aligned segments.