Changes between Version 4 and Version 5 of HabitSystemFinal
- Timestamp:
- Jun 2, 2017, 12:29:14 PM (7 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
HabitSystemFinal
v4 v5 37 37 * Corpus size: 20 million tokens. 38 38 * Crawled by !SpiderLing in August 2013, October 2015 and January 2016. Boilerplate-cleaned, de-duplicated. 39 * Tagged by !TreeTagger trained on Amharic WiC.39 * Tagged using the [https://nlp.fi.muni.cz/projekty/habit/amtag/index.cgi HaBiT Amharic Tagger module] ([http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ !TreeTagger] trained on [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=am_wic Amharic WiC corpus]) applying the [https://www.researchgate.net/profile/Girma_Demeke/publication/237553785_Manual_Annotation_of_Amharic_News_Items_with_Part-of-Speech_Tags_and_its_Challenges/links/57045f2308ae74a08e246382.pdf Amharic POS tagset]. 40 40 * [AmharicCorpus Corpus deliverable/technical report] 41 41 42 42 ==== Examples of HaBiT System features for the Amharic Web Corpus ==== 43 * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=amwac16 Corpus information] 43 * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=amwac16 Corpus information] – Document count, sentence count, word count, lexicon sizes, tagset description. 44 44 * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=amwac16&reload=&iquery=%E1%88%98%E1%8A%95%E1%8C%8D%E1%88%A5%E1%89%B5&queryselector=iqueryrow&phrase=&word=&char=&cql=&default_attr=word&fc_lemword_window_type=both&fc_lemword_wsize=5&fc_lemword=&fc_lemword_type=all&fsca_doc.t2ld=&fsca_doc.urldomain= Examples of the use of "መንግሥት" ("government")] – Words or phrases in a natural Amharic context. The base function for language study and the source of good dictionary examples. 45 45 * [http://corpora.fi.muni.cz/habit/run.cgi/wsketch?corpname=amwac16&reload=&lemma=%E1%88%98%E1%8A%95%E1%8C%8D%E1%88%A5%E1%89%B5&minfreq=6&minscore=0.0&maxitems=20&sort_ws_columns=s&show_lemma_coverage=0&clustercolls=0&minsim=0.15&structured=0&structured=1&min_unary_score=5.0&min_mwlink_freq=100&nr_ws_cols=5&bim_corpname=&bim_lemma= Grammatical and collocational behaviour of "መንግሥት" ("government")] – An essential feature for creating dictionaries in Amharic. … … 51 51 * Corpus size: 5.1 million tokens. 52 52 * Crawled by !SpiderLing in January 2016. Boilerplate-cleaned, de-duplicated. 53 * Tagged with the Universal POS tagset.53 * Tagged using the [https://nlp.fi.muni.cz/projekty/habit/omtag/index.cgi HaBiT Oromo Tagger module] applying the [http://universaldependencies.org/u/pos/ Universal POS tagset]. 54 54 * [OromoCorpus Corpus deliverable/technical report] 55 55 56 56 ==== Examples of HaBiT System features for the Oromo Web Corpus ==== 57 * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=orwac16 Corpus information] 57 * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=orwac16 Corpus information] – Document count, sentence count, word count, lexicon sizes, tagset description. 58 58 * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=orwac16&reload=&iquery=mootummaa&queryselector=iqueryrow&phrase=&word=&char=&cql=&default_attr=word&fc_lemword_window_type=both&fc_lemword_wsize=5&fc_lemword=&fc_lemword_type=all&fsca_doc.t2ld=&fsca_doc.urldomain= Examples of the use of "mootummaa" ("government") in context] – Words or phrases in a natural Oromo context. The base function for language study and the source of good dictionary examples. 59 59 * [http://corpora.fi.muni.cz/habit/run.cgi/wsketch?corpname=orwac16&reload=&lemma=mootummaa&minfreq=auto&minscore=0.0&maxitems=20&sort_ws_columns=s&show_lemma_coverage=0&clustercolls=0&minsim=0.15&structured=0&structured=1&min_unary_score=5.0&min_mwlink_freq=100&nr_ws_cols=5&bim_corpname=&bim_lemma= Grammatical and collocational behaviour of "mootummaa" ("government")] – An essential feature for creating dictionaries in Oromo. … … 65 65 * Corpus size: 80 million tokens. 66 66 * Crawled by !SpiderLing in January 2016. Boilerplate-cleaned, de-duplicated. 67 * Tagged with the Universal POS tagset.67 * Tagged using the [https://nlp.fi.muni.cz/projekty/habit/sotag/index.cgi HaBiT Somali Tagger module] applying the [http://universaldependencies.org/u/pos/ Universal POS tagset]. 68 68 * [SomaliCorpus Corpus deliverable/technical report] 69 69 70 70 ==== Examples of HaBiT System features for the Somali Web Corpus ==== 71 * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=sowac16 Corpus information] 71 * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=sowac16 Corpus information] – Document count, sentence count, word count, lexicon sizes, tagset description. 72 72 * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=sowac16&reload=&iquery=dowladda&queryselector=iqueryrow&phrase=&word=&char=&cql=&default_attr=word&fc_lemword_window_type=both&fc_lemword_wsize=5&fc_lemword=&fc_lemword_type=all&fsca_doc.tld=&fsca_doc.t2ld=&fsca_doc.urldomain= Examples of the use of "dowladda" ("government") in context] – Words or phrases in a natural Somali context. The base function for language study and the source of good dictionary examples. 73 73 * [http://corpora.fi.muni.cz/habit/run.cgi/wsketch?corpname=sowac16&reload=&lemma=dowladda&minfreq=auto&minscore=0.0&maxitems=20&sort_ws_columns=s&show_lemma_coverage=0&clustercolls=0&minsim=0.15&structured=0&structured=1&min_unary_score=5.0&min_mwlink_freq=100&nr_ws_cols=5&bim_corpname=&bim_lemma= Grammatical and collocational behaviour of "dowladda" ("government")] – An essential feature for creating dictionaries in Somali. … … 79 79 * Corpus size: 2.5 million tokens. 80 80 * Crawled by !SpiderLing in January 2016. Boilerplate-cleaned, de-duplicated. 81 * Tagged with the Universal POS tagset.81 * Tagged using the [https://nlp.fi.muni.cz/projekty/habit/titag/index.cgi HaBiT Tigrinya Tagger module] applying the [http://universaldependencies.org/u/pos/ Universal POS tagset]. 82 82 * [TigrinyaCorpus Corpus deliverable/technical report] 83 83 84 84 ==== Examples of HaBiT System features for the Tigrinya Web Corpus ==== 85 * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=tiwac16 Corpus information] 85 * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=tiwac16 Corpus information] – Document count, sentence count, word count, lexicon sizes, tagset description. 86 86 * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=tiwac16&reload=&iquery=%E1%88%98%E1%8A%95%E1%8C%8D%E1%88%B5%E1%89%B2&queryselector=iqueryrow&phrase=&word=&wpos=&char=&cql=&default_attr=word&fc_lemword_window_type=both&fc_lemword_wsize=5&fc_lemword=&fc_lemword_type=all&fc_pos_window_type=both&fc_pos_wsize=5&fc_pos_type=all&fsca_doc.t2ld=&fsca_doc.urldomain= Examples of the use of "መንግስቲ " ("government") in context] – Words or phrases in a natural Tigrinya context. The base function for language study and the source of good dictionary examples. 87 87 * [http://corpora.fi.muni.cz/habit/run.cgi/wsketch?corpname=tiwac16&reload=&lemma=%E1%88%98%E1%8A%95%E1%8C%8D%E1%88%B5%E1%89%B2&minfreq=auto&minscore=0.0&maxitems=20&sort_ws_columns=s&show_lemma_coverage=0&clustercolls=0&minsim=0.15&structured=0&structured=1&min_unary_score=5.0&min_mwlink_freq=100&nr_ws_cols=5&bim_corpname=&bim_lemma= Grammatical and collocational behaviour of "መንግስቲ " ("government")] – An essential feature for creating dictionaries in Tigrinya. … … 96 96 97 97 ==== Examples of HaBiT System features for the Norwegian Bokmål Web Corpus ==== 98 * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=notenten15_4_bokmal Corpus information] 98 * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=notenten15_4_bokmal Corpus information] – Document count, sentence count, word count, lexicon sizes, tagset description. 99 99 * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=notenten15_4_bokmal&reload=&iquery=regjering&queryselector=iqueryrow&lemma=&lpos=&phrase=&word=&wpos=&char=&cql=&default_attr=word&fc_lemword_window_type=both&fc_lemword_wsize=5&fc_lemword=&fc_lemword_type=all&fc_pos_window_type=both&fc_pos_wsize=5&fc_pos_type=all&fsca_doc.t2ld=&fsca_doc.urldomain= Examples of the use of "regjering" ("government") in context] – Words or phrases in a natural Norwegian Bokmål context. The base function for language study and the source of good dictionary examples. 100 100 * [http://corpora.fi.muni.cz/habit/run.cgi/wsketch?corpname=notenten15_4_bokmal&reload=&lemma=regjering&lpos=&minfreq=auto&minscore=0.0&maxitems=25&sort_ws_columns=s&show_lemma_coverage=0&clustercolls=0&minsim=0.15&structured=0&structured=1&min_unary_score=5.0&min_mwlink_freq=100&nr_ws_cols=4&bim_corpname=&bim_lemma= Grammatical and collocational behaviour of "regjering" ("government")] – An essential feature for creating dictionaries in Norwegian Bokmål. … … 109 109 110 110 ==== Examples of HaBiT System features for the Norwegian Nynorsk Web Corpus ==== 111 * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=notenten15_4_nynorsk Corpus information] 111 * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=notenten15_4_nynorsk Corpus information] – Document count, sentence count, word count, lexicon sizes, tagset description. 112 112 * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=notenten15_4_nynorsk&reload=&iquery=regjering&queryselector=iqueryrow&phrase=&word=&char=&cql=&default_attr=word&fc_lemword_window_type=both&fc_lemword_wsize=5&fc_lemword=&fc_lemword_type=all&fsca_doc.t2ld=&fsca_doc.urldomain= Examples of the use of "regjering" ("government") in context] – Words or phrases in a natural Norwegian Nynorsk context. The base function for language study and the source of good dictionary examples. 113 113 … … 119 119 120 120 ==== Examples of HaBiT System features for the Czech Web Corpus ==== 121 * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=cztenten16_0 Corpus information] 121 * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=cztenten16_0 Corpus information] – Document count, sentence count, word count, lexicon sizes, tagset description. 122 122 * [http://corpora.fi.muni.cz/habit/run.cgi/reduce?q=aword%2C%5Blc%3D%22vl%C3%A1da%22%7Clemma_lc%3D%22vl%C3%A1da%22%5D&q=Fdoc&corpname=cztenten16_0&viewmode=sen&attrs=word&ctxattrs=word&structs=p%2Cg&refs=doc&pagesize=40&gdexconf=&iquery=vl%C3%A1da&attr_tooltip=nott&rlines=250 Examples of the use of "vláda" ("government") in context] – Words or phrases in a natural Czech context. The base function for language study and the source of good dictionary examples. 123 123 * [http://corpora.fi.muni.cz/habit/run.cgi/wsketch?corpname=cztenten16_0&reload=&lemma=vl%C3%A1da&minfreq=auto&minscore=0.0&maxitems=25&sort_ws_columns=s&show_lemma_coverage=0&clustercolls=0&minsim=0.15&structured=0&structured=1&min_unary_score=5.0&min_mwlink_freq=100&nr_ws_cols=5&bim_corpname=&bim_lemma= Grammatical and collocational behaviour of "vláda" ("government")] – An essential feature for creating dictionaries in Czech. … … 132 132 133 133 ==== Examples of HaBiT System features for the Czech-!Norwegian/Norwegian-Czech Parallel Corpus ==== 134 * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=czech_norwegian_opus__czech Czech-Norwegian Parallel Corpus information], [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=czech_norwegian_opus__norwegian Norwegian-Czech Parallel Corpus information] 134 * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=czech_norwegian_opus__czech Czech-Norwegian Parallel Corpus information], [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=czech_norwegian_opus__norwegian Norwegian-Czech Parallel Corpus information] – Document count, word count, lexicon sizes, tagset description. 135 135 * [http://corpora.fi.muni.cz/habit/run.cgi/view?q=aword%2C%5Blc%3D%22vl%C3%A1da%22%7Clemma_lc%3D%22vl%C3%A1da%22%5D+within+czech_norwegian_opus__norwegian%3A%5Blc%3D%22regjering%22%5D;corpname=czech_norwegian_opus__czech;viewmode=align;attrs=word&ctxattrs=word&structs=p%2Cg&refs=align&pagesize=40&align=czech_norwegian_opus__norwegian&gdexconf=&iquery=vl%C3%A1da&maincorp=czech_norwegian_opus__czech&attr_tooltip=nott;fromp=1 Examples of the use of Czech "vláda" ("government") with aligned segments of Norwegian "regjering" ("government") in context] – Words or phrases in a natural Czech and Norwegian context. The base function for language study and translation services.