Changes between Version 2 and Version 3 of HabitSystemV3


Ignore:
Timestamp:
May 31, 2017, 3:26:28 PM (7 years ago)
Author:
xsuchom2
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • HabitSystemV3

    v2 v3  
    1414 * [https://nlp.fi.muni.cz/projekty/habit/geez2sera/index.cgi geez2sera] is a tool for converting (transliterating) [https://en.wikipedia.org/wiki/Ge%27ez_script Geez] script into [http://abyssiniagateway.net/fidel/sera-faq_1.html SERA], System for Ethiopic Representation using Latin ASCII characters. It is based on [https://github.com/fgaim/HornMorpho L3 Python library] (GPL licence).
    1515 * [https://nlp.fi.muni.cz/projekty/habit/amtag/index.cgi AmTag] is a tagger for Amharic. It is [http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ TreeTagger] with an Amharic model.
     16 * [http://code.activestate.com/recipes/326576-language-detection-using-character-trigrams/ Language identification using trigrams] (licensed under the [https://docs.python.org/3/license.html PSF licence])
    1617
    1718=== Language resources ===
    1819
    1920==== Wordlists, top 1000 items ====
    20 
    21   * [raw-attachment:Amharic_geez.wl.txt Amharic wordlist, Geez'], browse the wordlist [http://corpora.fi.muni.cz/habit/run.cgi/wordlist?corpname=amwac16&viewmode=align&refs=%3Ddoc.t2ld&wlmaxitems=100&wlsort=f&subcnorm=freq&corpname=amwac16&reload=&wlattr=word&usengrams=0&ngrams_n=2&wlpat=&wlminfreq=5&wlmaxfreq=0&wlfile=&wlblacklist=&wlnums=frq&wltype=simple online]
    22   * [raw-attachment:Amharic_sera.wl.txt Amharic wordlist, SERA], browse the wordlist [http://corpora.fi.muni.cz/habit/run.cgi/wordlist?corpname=amwac16&viewmode=align&refs=%3Ddoc.t2ld&wlmaxitems=100&wlsort=f&subcnorm=freq&corpname=amwac16&reload=&wlattr=sera&usengrams=0&ngrams_n=2&wlpat=&wlminfreq=5&wlmaxfreq=0&wlfile=&wlblacklist=&wlnums=frq&wltype=simple online]
    23   * [raw-attachment:Oromo.wl.txt Oromo wordlist], browse the wordlist [http://corpora.fi.muni.cz/habit/run.cgi/wordlist?corpname=orwac16&viewmode=align&refs=%3Ddoc.t2ld&wlmaxitems=100&wlsort=f&subcnorm=freq&corpname=orwac16&reload=&wlattr=lc&usengrams=0&ngrams_n=2&wlpat=&wlminfreq=5&wlmaxfreq=0&wlfile=&wlblacklist=&wlnums=frq&wltype=simple online]
    24   * [raw-attachment:Somali.wl.txt Somali wordlist], browse the wordlist [http://corpora.fi.muni.cz/habit/run.cgi/wordlist?corpname=sowac16&refs=%3Ddoc.t2ld&wlmaxitems=100&wlsort=f&subcnorm=freq&corpname=sowac16&reload=&wlattr=word&usengrams=0&ngrams_n=2&wlpat=&wlminfreq=5&wlmaxfreq=0&wlfile=&wlblacklist=&wlnums=frq&wltype=simple online]
    25   * [raw-attachment:Tigrinya_geez.wl.txt Tigrinya wordlist, Geez'], browse the wordlist [http://corpora.fi.muni.cz/habit/run.cgi/wordlist?corpname=tiwac16&viewmode=align&refs=%3Ddoc.t2ld&wlmaxitems=100&wlsort=f&subcnorm=freq&corpname=tiwac16&reload=&usesubcorp=&wlattr=word&usengrams=0&ngrams_n=2&wlpat=&wlminfreq=5&wlmaxfreq=0&wlfile=&wlblacklist=&wlnums=frq&wltype=simple online]
    26   * [raw-attachment:Tigrinya_sera.wl.txt Tigrinya wordlist, SERA], browse the wordlist [http://corpora.fi.muni.cz/habit/run.cgi/wordlist?corpname=tiwac16&viewmode=align&refs=%3Ddoc.t2ld&wlmaxitems=100&wlsort=f&subcnorm=freq&corpname=tiwac16&reload=&usesubcorp=&wlattr=sera&usengrams=0&ngrams_n=2&wlpat=&wlminfreq=5&wlmaxfreq=0&wlfile=&wlblacklist=&wlnums=frq&wltype=simple online]
     21  * [raw-attachment:Amharic_geez.wl.txt Amharic wordlist, Geez'], [http://corpora.fi.muni.cz/habit/run.cgi/wordlist_form?corpname=amwac16 browse the Amharic wordlist online]
     22  * [raw-attachment:Amharic_sera.wl.txt Amharic wordlist, SERA], [http://corpora.fi.muni.cz/habit/run.cgi/wordlist_form?corpname=amwac16 browse the Amharic wordlist online]
     23  * [raw-attachment:Czech.wl.txt Czech wordlist], [http://corpora.fi.muni.cz/habit/run.cgi/wordlist_form?corpname=cztenten16_0 browse the Czech wordlist online]
     24  * [raw-attachment:Norwegian_Bokmal.wl.txt Norwegian Bokmål wordlist], [http://corpora.fi.muni.cz/habit/run.cgi/wordlist_form?corpname=notenten15_4_bokmal browse the Norwegian Bokmål wordlist online]
     25  * [raw-attachment:Norwegian_Nynorsk.wl.txt Norwegian Nynorsk wordlist], [http://corpora.fi.muni.cz/habit/run.cgi/wordlist_form?corpname=notenten15_4_nynorsk browse the Norwegian Nynorsk wordlist online]
     26  * [raw-attachment:Oromo.wl.txt Oromo wordlist], [http://corpora.fi.muni.cz/habit/run.cgi/wordlist_form?corpname=orwac16 browse the Oromo wordlist online]
     27  * [raw-attachment:Somali.wl.txt Somali wordlist], [http://corpora.fi.muni.cz/habit/run.cgi/wordlist_form?corpname=sowac16 browse the Somali wordlist online]
     28  * [raw-attachment:Tigrinya_geez.wl.txt Tigrinya wordlist, Geez'], [http://corpora.fi.muni.cz/habit/run.cgi/wordlist_form?corpname=tiwac16 browse the Tigrinya wordlist online]
     29  * [raw-attachment:Tigrinya_sera.wl.txt Tigrinya wordlist, SERA], [http://corpora.fi.muni.cz/habit/run.cgi/wordlist_form?corpname=tiwac16 browse the Tigrinya wordlist online]
    2730
    2831=== Language dependent models ===
    29  * Language samples for language identification (possibly using a character trigram model):
     32 * Language samples for language identification using a character trigram model:
    3033  * [raw-attachment:Amharic_sample.txt​]
    3134  * [raw-attachment:Arabic_sample.txt​]
     35  * [raw-attachment:Czech_sample.txt​]
     36  * [raw-attachment:Danish_sample.txt​]
    3237  * [raw-attachment:English_sample.txt​]
     38  * [raw-attachment:Norwegian_Bokmal_sample.txt​]
     39  * [raw-attachment:Norwegian_Nynorsk_sample.txt​]
    3340  * [raw-attachment:Oromo_sample.txt​]
     41  * [raw-attachment:Slovak_sample.txt​]
    3442  * [raw-attachment:Somali_sample.txt​]
    3543  * [raw-attachment:Tigrinya_sample.txt​]
    36  * Wordlists for language identification and boilerplate removal:
     44 * Wordlists for language identification and boilerplate removal using Justext:
    3745  * [raw-attachment:Amharic_justext.txt​]
    3846  * [raw-attachment:Arabic_justext.txt​]
     47  * [raw-attachment:Czech_justext.txt​]
     48  * [raw-attachment:Danish_justext.txt​]
    3949  * [raw-attachment:English_justext.txt​]
     50  * [raw-attachment:Norwegian_Bokmal_justext.txt​]
     51  * [raw-attachment:Norwegian_Nynorsk_justext.txt​]
    4052  * [raw-attachment:Oromo_justext.txt​]
     53  * [raw-attachment:Slovak_justext.txt​]
    4154  * [raw-attachment:Somali_justext.txt​]
    4255  * [raw-attachment:Tigrinya_justext.txt​]
    4356 * Byte trigram models for character encoding detection using chared:
    4457  * [raw-attachment:Arabic.edm]
     58  * [raw-attachment:Czech.edm]
     59  * [raw-attachment:Danish.edm]
    4560  * [raw-attachment:English.edm]
     61  * [raw-attachment:Norwegian.edm]
     62  * [raw-attachment:Slovak.edm]
    4663  * [raw-attachment:Universal_UTF8.edm] -- used for Amharic, Oromo, Somali, Tigrinya
    4764 * Tokenisation rules for tokenisation using Unitok:
    48   * [raw-attachment:unitok_Arabic.py]
    49   * [raw-attachment:unitok_English.py]
     65  * [raw-attachment:unitok_Czech.py]
    5066  * [raw-attachment:unitok_Ethiopan.py] -- used for Amharic, Oromo, Somali, Tigrinya
     67  * [raw-attachment:unitok_Norwegian.py]
     68
     69Notes:
     70  * Slovak models were used for distinguishing Czech from Slovak
     71  * Danish models were used for distinguishing Norwegian from Danish