Changes between Version 4 and Version 5 of HabitSystemFinal


Ignore:
Timestamp:
Jun 2, 2017, 12:29:14 PM (7 years ago)
Author:
xsuchom2
Comment:

Additional info

Legend:

Unmodified
Added
Removed
Modified
  • HabitSystemFinal

    v4 v5  
    3737 * Corpus size: 20 million tokens.
    3838 * Crawled by !SpiderLing  in August 2013, October 2015 and January 2016. Boilerplate-cleaned, de-duplicated.
    39  * Tagged by !TreeTagger trained on Amharic WiC.
     39 * Tagged using the [https://nlp.fi.muni.cz/projekty/habit/amtag/index.cgi HaBiT Amharic Tagger module] ([http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ !TreeTagger] trained on [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=am_wic Amharic WiC corpus]) applying the [https://www.researchgate.net/profile/Girma_Demeke/publication/237553785_Manual_Annotation_of_Amharic_News_Items_with_Part-of-Speech_Tags_and_its_Challenges/links/57045f2308ae74a08e246382.pdf Amharic POS tagset].
    4040 * [AmharicCorpus Corpus deliverable/technical report]
    4141
    4242==== Examples of HaBiT System features for the Amharic Web Corpus ====
    43  * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=amwac16 Corpus information]
     43 * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=amwac16 Corpus information] – Document count, sentence count, word count, lexicon sizes, tagset description.
    4444 * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=amwac16&reload=&iquery=%E1%88%98%E1%8A%95%E1%8C%8D%E1%88%A5%E1%89%B5&queryselector=iqueryrow&phrase=&word=&char=&cql=&default_attr=word&fc_lemword_window_type=both&fc_lemword_wsize=5&fc_lemword=&fc_lemword_type=all&fsca_doc.t2ld=&fsca_doc.urldomain= Examples of the use of "መንግሥት" ("government")] – Words or phrases in a natural Amharic context. The base function for language study and the source of good dictionary examples.
    4545 * [http://corpora.fi.muni.cz/habit/run.cgi/wsketch?corpname=amwac16&reload=&lemma=%E1%88%98%E1%8A%95%E1%8C%8D%E1%88%A5%E1%89%B5&minfreq=6&minscore=0.0&maxitems=20&sort_ws_columns=s&show_lemma_coverage=0&clustercolls=0&minsim=0.15&structured=0&structured=1&min_unary_score=5.0&min_mwlink_freq=100&nr_ws_cols=5&bim_corpname=&bim_lemma= Grammatical and collocational behaviour of "መንግሥት" ("government")] – An essential feature for creating dictionaries in Amharic.
     
    5151 * Corpus size: 5.1 million tokens.
    5252 * Crawled by !SpiderLing in January 2016. Boilerplate-cleaned, de-duplicated.
    53  * Tagged with the Universal POS tagset.
     53 * Tagged using the [https://nlp.fi.muni.cz/projekty/habit/omtag/index.cgi HaBiT Oromo Tagger module] applying the [http://universaldependencies.org/u/pos/ Universal POS tagset].
    5454 * [OromoCorpus Corpus deliverable/technical report]
    5555
    5656==== Examples of HaBiT System features for the Oromo Web Corpus ====
    57  * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=orwac16 Corpus information]
     57 * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=orwac16 Corpus information] – Document count, sentence count, word count, lexicon sizes, tagset description.
    5858 * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=orwac16&reload=&iquery=mootummaa&queryselector=iqueryrow&phrase=&word=&char=&cql=&default_attr=word&fc_lemword_window_type=both&fc_lemword_wsize=5&fc_lemword=&fc_lemword_type=all&fsca_doc.t2ld=&fsca_doc.urldomain= Examples of the use of "mootummaa" ("government") in context] – Words or phrases in a natural Oromo context. The base function for language study and the source of good dictionary examples.
    5959 * [http://corpora.fi.muni.cz/habit/run.cgi/wsketch?corpname=orwac16&reload=&lemma=mootummaa&minfreq=auto&minscore=0.0&maxitems=20&sort_ws_columns=s&show_lemma_coverage=0&clustercolls=0&minsim=0.15&structured=0&structured=1&min_unary_score=5.0&min_mwlink_freq=100&nr_ws_cols=5&bim_corpname=&bim_lemma= Grammatical and collocational behaviour of "mootummaa" ("government")] – An essential feature for creating dictionaries in Oromo.
     
    6565 * Corpus size: 80 million tokens.
    6666 * Crawled by !SpiderLing in January 2016. Boilerplate-cleaned, de-duplicated.
    67  * Tagged with the Universal POS tagset.
     67 * Tagged using the [https://nlp.fi.muni.cz/projekty/habit/sotag/index.cgi HaBiT Somali Tagger module] applying the [http://universaldependencies.org/u/pos/ Universal POS tagset].
    6868 * [SomaliCorpus Corpus deliverable/technical report]
    6969
    7070==== Examples of HaBiT System features for the Somali Web Corpus ====
    71  * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=sowac16 Corpus information]
     71 * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=sowac16 Corpus information] – Document count, sentence count, word count, lexicon sizes, tagset description.
    7272 * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=sowac16&reload=&iquery=dowladda&queryselector=iqueryrow&phrase=&word=&char=&cql=&default_attr=word&fc_lemword_window_type=both&fc_lemword_wsize=5&fc_lemword=&fc_lemword_type=all&fsca_doc.tld=&fsca_doc.t2ld=&fsca_doc.urldomain= Examples of the use of "dowladda" ("government") in context] – Words or phrases in a natural Somali context. The base function for language study and the source of good dictionary examples.
    7373 * [http://corpora.fi.muni.cz/habit/run.cgi/wsketch?corpname=sowac16&reload=&lemma=dowladda&minfreq=auto&minscore=0.0&maxitems=20&sort_ws_columns=s&show_lemma_coverage=0&clustercolls=0&minsim=0.15&structured=0&structured=1&min_unary_score=5.0&min_mwlink_freq=100&nr_ws_cols=5&bim_corpname=&bim_lemma= Grammatical and collocational behaviour of "dowladda" ("government")] – An essential feature for creating dictionaries in Somali.
     
    7979 * Corpus size: 2.5 million tokens.
    8080 * Crawled by !SpiderLing in January 2016. Boilerplate-cleaned, de-duplicated.
    81  * Tagged with the Universal POS tagset.
     81 * Tagged using the [https://nlp.fi.muni.cz/projekty/habit/titag/index.cgi HaBiT Tigrinya Tagger module] applying the [http://universaldependencies.org/u/pos/ Universal POS tagset].
    8282 * [TigrinyaCorpus Corpus deliverable/technical report]
    8383
    8484==== Examples of HaBiT System features for the Tigrinya Web Corpus ====
    85  * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=tiwac16 Corpus information]
     85 * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=tiwac16 Corpus information] – Document count, sentence count, word count, lexicon sizes, tagset description.
    8686 * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=tiwac16&reload=&iquery=%E1%88%98%E1%8A%95%E1%8C%8D%E1%88%B5%E1%89%B2&queryselector=iqueryrow&phrase=&word=&wpos=&char=&cql=&default_attr=word&fc_lemword_window_type=both&fc_lemword_wsize=5&fc_lemword=&fc_lemword_type=all&fc_pos_window_type=both&fc_pos_wsize=5&fc_pos_type=all&fsca_doc.t2ld=&fsca_doc.urldomain= Examples of the use of "መንግስቲ " ("government") in context] – Words or phrases in a natural Tigrinya context. The base function for language study and the source of good dictionary examples.
    8787 * [http://corpora.fi.muni.cz/habit/run.cgi/wsketch?corpname=tiwac16&reload=&lemma=%E1%88%98%E1%8A%95%E1%8C%8D%E1%88%B5%E1%89%B2&minfreq=auto&minscore=0.0&maxitems=20&sort_ws_columns=s&show_lemma_coverage=0&clustercolls=0&minsim=0.15&structured=0&structured=1&min_unary_score=5.0&min_mwlink_freq=100&nr_ws_cols=5&bim_corpname=&bim_lemma= Grammatical and collocational behaviour of "መንግስቲ " ("government")] – An essential feature for creating dictionaries in Tigrinya.
     
    9696
    9797==== Examples of HaBiT System features for the Norwegian Bokmål Web Corpus ====
    98  * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=notenten15_4_bokmal Corpus information]
     98 * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=notenten15_4_bokmal Corpus information] – Document count, sentence count, word count, lexicon sizes, tagset description.
    9999 * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=notenten15_4_bokmal&reload=&iquery=regjering&queryselector=iqueryrow&lemma=&lpos=&phrase=&word=&wpos=&char=&cql=&default_attr=word&fc_lemword_window_type=both&fc_lemword_wsize=5&fc_lemword=&fc_lemword_type=all&fc_pos_window_type=both&fc_pos_wsize=5&fc_pos_type=all&fsca_doc.t2ld=&fsca_doc.urldomain= Examples of the use of "regjering" ("government") in context] – Words or phrases in a natural Norwegian Bokmål context. The base function for language study and the source of good dictionary examples.
    100100 * [http://corpora.fi.muni.cz/habit/run.cgi/wsketch?corpname=notenten15_4_bokmal&reload=&lemma=regjering&lpos=&minfreq=auto&minscore=0.0&maxitems=25&sort_ws_columns=s&show_lemma_coverage=0&clustercolls=0&minsim=0.15&structured=0&structured=1&min_unary_score=5.0&min_mwlink_freq=100&nr_ws_cols=4&bim_corpname=&bim_lemma= Grammatical and collocational behaviour of "regjering" ("government")] – An essential feature for creating dictionaries in Norwegian Bokmål.
     
    109109
    110110==== Examples of HaBiT System features for the Norwegian Nynorsk Web Corpus ====
    111  * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=notenten15_4_nynorsk Corpus information]
     111 * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=notenten15_4_nynorsk Corpus information] – Document count, sentence count, word count, lexicon sizes, tagset description.
    112112 * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=notenten15_4_nynorsk&reload=&iquery=regjering&queryselector=iqueryrow&phrase=&word=&char=&cql=&default_attr=word&fc_lemword_window_type=both&fc_lemword_wsize=5&fc_lemword=&fc_lemword_type=all&fsca_doc.t2ld=&fsca_doc.urldomain= Examples of the use of "regjering" ("government") in context] – Words or phrases in a natural Norwegian Nynorsk context. The base function for language study and the source of good dictionary examples.
    113113
     
    119119
    120120==== Examples of HaBiT System features for the Czech Web Corpus ====
    121  * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=cztenten16_0 Corpus information]
     121 * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=cztenten16_0 Corpus information] – Document count, sentence count, word count, lexicon sizes, tagset description.
    122122 * [http://corpora.fi.muni.cz/habit/run.cgi/reduce?q=aword%2C%5Blc%3D%22vl%C3%A1da%22%7Clemma_lc%3D%22vl%C3%A1da%22%5D&q=Fdoc&corpname=cztenten16_0&viewmode=sen&attrs=word&ctxattrs=word&structs=p%2Cg&refs=doc&pagesize=40&gdexconf=&iquery=vl%C3%A1da&attr_tooltip=nott&rlines=250 Examples of the use of "vláda" ("government") in context] – Words or phrases in a natural Czech context. The base function for language study and the source of good dictionary examples.
    123123 * [http://corpora.fi.muni.cz/habit/run.cgi/wsketch?corpname=cztenten16_0&reload=&lemma=vl%C3%A1da&minfreq=auto&minscore=0.0&maxitems=25&sort_ws_columns=s&show_lemma_coverage=0&clustercolls=0&minsim=0.15&structured=0&structured=1&min_unary_score=5.0&min_mwlink_freq=100&nr_ws_cols=5&bim_corpname=&bim_lemma= Grammatical and collocational behaviour of "vláda" ("government")] – An essential feature for creating dictionaries in Czech.
     
    132132
    133133==== Examples of HaBiT System features for the Czech-!Norwegian/Norwegian-Czech Parallel Corpus ====
    134  * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=czech_norwegian_opus__czech Czech-Norwegian Parallel Corpus information], [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=czech_norwegian_opus__norwegian Norwegian-Czech Parallel Corpus information]
     134 * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=czech_norwegian_opus__czech Czech-Norwegian Parallel Corpus information], [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=czech_norwegian_opus__norwegian Norwegian-Czech Parallel Corpus information] – Document count, word count, lexicon sizes, tagset description.
    135135 * [http://corpora.fi.muni.cz/habit/run.cgi/view?q=aword%2C%5Blc%3D%22vl%C3%A1da%22%7Clemma_lc%3D%22vl%C3%A1da%22%5D+within+czech_norwegian_opus__norwegian%3A%5Blc%3D%22regjering%22%5D;corpname=czech_norwegian_opus__czech;viewmode=align;attrs=word&ctxattrs=word&structs=p%2Cg&refs=align&pagesize=40&align=czech_norwegian_opus__norwegian&gdexconf=&iquery=vl%C3%A1da&maincorp=czech_norwegian_opus__czech&attr_tooltip=nott;fromp=1 Examples of the use of Czech "vláda" ("government") with aligned segments of Norwegian "regjering" ("government") in context] – Words or phrases in a natural Czech and Norwegian context. The base function for language study and translation services.