= D1.3 The final HaBiT System: Tested and evaluated system demonstrator = The HaBiT System is accessible at ​[http://corpora.fi.muni.cz/habit/]. The following Corpus building tools were used to build eight corpora in Amharic, Oromo, Somali, Tigrinya, Norwegian and Czech: == Corpus building tools == === Software === The following software was developed or adapted for the HaBiT System: * WebBootCaT for HaBiT, a search engine querying and document downloading tool. It queries a search engine with user specified words or phrases, obtains URLs of relevant documents fount by the search engine and downloads the documents. This tool is an implementation of ''Baroni, Kilgarriff, Pomikálek, Rychlý. "WebBootCaT: instant domain-specific corpora to support human translators." In Proceedings of EAMT, pp. 247-252. 2006.'' * !SpiderLing, a web spider for linguistics — is software for obtaining text from the web useful for building text corpora. Many documents on the web only contain material not suitable for text corpora, such as site navigation, lists of links, lists of products, and other kind of text not comprised of full sentences. In fact such pages represent the vast majority of the web. Therefore, by doing unrestricted web crawls, we typically download a lot of data which gets filtered out during post-processing. This makes the process of web corpus collection inefficient. * Chared is a tool for detecting the character encoding of a text in a known language. The language of the text has to be specified as an input parameter so that correspondent language model can be used. The package contains models for a wide range of languages. In general, it should be more accurate than character encoding detection algorithms with no language constraints. * Justext is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. A HTML page is split to paragraphs and a context sensitive heuristic algorithm is employed to separate content from boilerplate. * Onion is a tool for removing duplicate parts from large collections of texts. One instance of duplicate parts is kept, others are marked or removed. The tool allows removing both identical and similar documents, on any level (document, paragraph, sentence — if present in the data). The de-duplication algorithm is based on comparing hashes of n-grams of words. * [http://code.activestate.com/recipes/326576-language-detection-using-character-trigrams/ Language identification using trigrams] (licensed under the [https://docs.python.org/3/license.html PSF licence]) * Unitok is a universal text tokeniser with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens (vertical format), while preserving XML-like tags containing metadata. * [https://nlp.fi.muni.cz/projekty/habit/geez2sera/index.cgi geez2sera] is a tool for converting (transliterating) [https://en.wikipedia.org/wiki/Ge%27ez_script Geez] script into [http://abyssiniagateway.net/fidel/sera-faq_1.html SERA], System for Ethiopic Representation using Latin ASCII characters. It is based on [https://github.com/fgaim/HornMorpho L3 Python library] (GPL licence). * [https://nlp.fi.muni.cz/projekty/habit/amtag/index.cgi AmTag] is a tagger for Amharic. It is [http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ TreeTagger] with an Amharic model. * [http://corpora.fi.muni.cz/habit/ Part of speech taggers for Oromo, Somali and Tigrinya] with respective models applying the [http://universaldependencies.org/u/pos/ Universal POS tagset]. === Language dependent models and configuration files === The following language dependent models and configuration files were produced for each language within the project: * Language sample for language identification using a character trigram model. * Wordlist for language identification and boilerplate removal using Justext. * Byte trigram model for character encoding detection using Chared. * Tokenisation rules for tokenisation using Unitok. * !SpiderLing crawler configuration. Notes: * Separate Bokmål and Nynorsk models were created to distinguish these written standards of Norwegian. * Arabic, English, Slovak and Danish models were created for removing texts not in target languages from corpus source documents. * See [https://habit-project.eu/wiki/HabitSystemV3 D1.2.3 The HaBiT system v3] for files with all mentioned language dependent models and configuration files. == Corpora == All corpora in the HaBiT System can be accessed at a public address at [http://corpora.fi.muni.cz/habit/]. Examples of HaBiT System features available for corpora in the project follow. === Amharic Web Corpus === [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=amwac16 Browse the Amharic Web Corpus] * Corpus size: 20 million tokens. * Crawled by !SpiderLing in August 2013, October 2015 and January 2016. Boilerplate-cleaned, de-duplicated. * Tagged using the [https://nlp.fi.muni.cz/projekty/habit/amtag/index.cgi HaBiT Amharic Tagger module] ([http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ !TreeTagger] trained on [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=am_wic Amharic WiC corpus]) applying the [https://www.researchgate.net/profile/Girma_Demeke/publication/237553785_Manual_Annotation_of_Amharic_News_Items_with_Part-of-Speech_Tags_and_its_Challenges/links/57045f2308ae74a08e246382.pdf Amharic POS tagset]. * [AmharicCorpus Corpus deliverable/technical report] ==== Examples of HaBiT System features for the Amharic Web Corpus ==== * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=amwac16 Corpus information] – Document count, sentence count, word count, lexicon sizes, tagset description. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/amharic_info.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=amwac16&reload=&iquery=%E1%88%98%E1%8A%95%E1%8C%8D%E1%88%A5%E1%89%B5&queryselector=iqueryrow&phrase=&word=&char=&cql=&default_attr=word&fc_lemword_window_type=both&fc_lemword_wsize=5&fc_lemword=&fc_lemword_type=all&fsca_doc.t2ld=&fsca_doc.urldomain= Examples of the use of "መንግሥት" ("government")] – Words or phrases in a natural Amharic context. The base function for language study and the source of good dictionary examples. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/amharic_conc.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/wsketch?corpname=amwac16&reload=&lemma=%E1%88%98%E1%8A%95%E1%8C%8D%E1%88%A5%E1%89%B5&minfreq=6&minscore=0.0&maxitems=20&sort_ws_columns=s&show_lemma_coverage=0&clustercolls=0&minsim=0.15&structured=0&structured=1&min_unary_score=5.0&min_mwlink_freq=100&nr_ws_cols=5&bim_corpname=&bim_lemma= Grammatical and collocational behaviour of "መንግሥት" ("government")] – An essential feature for creating dictionaries in Amharic. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/amharic_ws.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/thes?corpname=amwac16&reload=&lemma=%E1%88%98%E1%8A%95%E1%8C%8D%E1%88%A5%E1%89%B5&maxthesitems=60&minthesscore=0.0&includeheadword=0&clusteritems=0&minsim=0.15 Words used in the same context as "መንግሥት" ("government")] – A useful list for creating an Amharic thesaurus. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/amharic_thes.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/freqml?q=aword%2C%5Bword%3D%22.%7B3%2C%7D%22+%26+tag%3D%22N.*%22%5D&corpname=amwac16&viewmode=sen&refs=%3Ddoc.t2ld&ml=1&flimit=0&ml1attr=word&ml1ctx=0~0%3E0&freqlevel=2&ml2attr=sera&ml2ctx=0~0%3E0&ml3attr=word&ml3ctx=0~0%3E0&ml4attr=word&ml4ctx=0~0%3E0 List of Amharic nouns by frequency] – Useful for dictionary based applications, e.g. predictive text writing. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/amharic_frq.png)]] === Oromo Web Corpus === [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=orwac16 Browse the Oromo Web Corpus] * Corpus size: 5.1 million tokens. * Crawled by !SpiderLing in January 2016. Boilerplate-cleaned, de-duplicated. * Tagged using the [https://nlp.fi.muni.cz/projekty/habit/omtag/index.cgi HaBiT Oromo Tagger module] applying the [http://universaldependencies.org/u/pos/ Universal POS tagset]. * [OromoCorpus Corpus deliverable/technical report] ==== Examples of HaBiT System features for the Oromo Web Corpus ==== * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=orwac16 Corpus information] – Document count, sentence count, word count, lexicon sizes, tagset description. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/oromo_info.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=orwac16&reload=&iquery=mootummaa&queryselector=iqueryrow&phrase=&word=&char=&cql=&default_attr=word&fc_lemword_window_type=both&fc_lemword_wsize=5&fc_lemword=&fc_lemword_type=all&fsca_doc.t2ld=&fsca_doc.urldomain= Examples of the use of "mootummaa" ("government") in context] – Words or phrases in a natural Oromo context. The base function for language study and the source of good dictionary examples. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/oromo_conc.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/wsketch?corpname=orwac16&reload=&lemma=mootummaa&minfreq=auto&minscore=0.0&maxitems=20&sort_ws_columns=s&show_lemma_coverage=0&clustercolls=0&minsim=0.15&structured=0&structured=1&min_unary_score=5.0&min_mwlink_freq=100&nr_ws_cols=5&bim_corpname=&bim_lemma= Grammatical and collocational behaviour of "mootummaa" ("government")] – An essential feature for creating dictionaries in Oromo. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/oromo_ws.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/thes?corpname=orwac16&reload=&lemma=mootummaa&maxthesitems=60&minthesscore=0.0&includeheadword=0&clusteritems=0&minsim=0.15 Words used in the same context as "mootummaa" ("government")] – A useful list for creating an Oromo thesaurus. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/oromo_thes.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/struct_wordlist?corpname=orwac16&refs=%3Ddoc.t2ld&wlmaxitems=100&wlsort=f&subcnorm=freq&corpname=orwac16&reload=&wlattr=tag&usengrams=0&ngrams_n=2&wlpat=NOUN&wlminfreq=5&wlmaxfreq=0&wlfile=&wlblacklist=&wlnums=frq&wltype=multilevel&wlstruct_attr1=word&wlstruct_attr2=&wlstruct_attr3= List of Oromo nouns by frequency] – Useful for dictionary based applications, e.g. predictive text writing. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/oromo_frq.png)]] === Somali Web Corpus === [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=sowac16 Browse the Somali Web Corpus] * Corpus size: 80 million tokens. * Crawled by !SpiderLing in January 2016. Boilerplate-cleaned, de-duplicated. * Tagged using the [https://nlp.fi.muni.cz/projekty/habit/sotag/index.cgi HaBiT Somali Tagger module] applying the [http://universaldependencies.org/u/pos/ Universal POS tagset]. * [SomaliCorpus Corpus deliverable/technical report] ==== Examples of HaBiT System features for the Somali Web Corpus ==== * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=sowac16 Corpus information] – Document count, sentence count, word count, lexicon sizes, tagset description. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/somali_info.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=sowac16&reload=&iquery=dowladda&queryselector=iqueryrow&phrase=&word=&char=&cql=&default_attr=word&fc_lemword_window_type=both&fc_lemword_wsize=5&fc_lemword=&fc_lemword_type=all&fsca_doc.tld=&fsca_doc.t2ld=&fsca_doc.urldomain= Examples of the use of "dowladda" ("government") in context] – Words or phrases in a natural Somali context. The base function for language study and the source of good dictionary examples. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/somali_conc.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/wsketch?corpname=sowac16&reload=&lemma=dowladda&minfreq=auto&minscore=0.0&maxitems=20&sort_ws_columns=s&show_lemma_coverage=0&clustercolls=0&minsim=0.15&structured=0&structured=1&min_unary_score=5.0&min_mwlink_freq=100&nr_ws_cols=5&bim_corpname=&bim_lemma= Grammatical and collocational behaviour of "dowladda" ("government")] – An essential feature for creating dictionaries in Somali. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/somali_ws.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/thes?corpname=sowac16&reload=&lemma=dowladda&maxthesitems=60&minthesscore=0.0&includeheadword=0&clusteritems=0&minsim=0.15 Words used in the same context as "dowladda" ("government")] – A useful list for creating a Somali thesaurus. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/somali_thes.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/struct_wordlist?corpname=sowac16&refs=%3Ddoc.t2ld&wlmaxitems=100&wlsort=f&subcnorm=freq&corpname=sowac16&reload=&wlattr=tag&usengrams=0&ngrams_n=2&wlpat=NOUN&wlminfreq=5&wlmaxfreq=0&wlfile=&wlblacklist=&wlnums=frq&wltype=multilevel&wlstruct_attr1=word&wlstruct_attr2=&wlstruct_attr3= List of Somali nouns by frequency] – Useful for dictionary based applications, e.g. predictive text writing. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/somali_frq.png)]] === Tigrinya Web Corpus === [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=tiwac16 Browse the Tigrinya Web Corpus] * Corpus size: 2.5 million tokens. * Crawled by !SpiderLing in January 2016. Boilerplate-cleaned, de-duplicated. * Tagged using the [https://nlp.fi.muni.cz/projekty/habit/titag/index.cgi HaBiT Tigrinya Tagger module] applying the [http://universaldependencies.org/u/pos/ Universal POS tagset]. * [TigrinyaCorpus Corpus deliverable/technical report] ==== Examples of HaBiT System features for the Tigrinya Web Corpus ==== * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=tiwac16 Corpus information] – Document count, sentence count, word count, lexicon sizes, tagset description. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/tigrinya_info.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=tiwac16&reload=&iquery=%E1%88%98%E1%8A%95%E1%8C%8D%E1%88%B5%E1%89%B2&queryselector=iqueryrow&phrase=&word=&wpos=&char=&cql=&default_attr=word&fc_lemword_window_type=both&fc_lemword_wsize=5&fc_lemword=&fc_lemword_type=all&fc_pos_window_type=both&fc_pos_wsize=5&fc_pos_type=all&fsca_doc.t2ld=&fsca_doc.urldomain= Examples of the use of "መንግስቲ " ("government") in context] – Words or phrases in a natural Tigrinya context. The base function for language study and the source of good dictionary examples. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/tigrinya_conc.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/wsketch?corpname=tiwac16&reload=&lemma=%E1%88%98%E1%8A%95%E1%8C%8D%E1%88%B5%E1%89%B2&minfreq=auto&minscore=0.0&maxitems=20&sort_ws_columns=s&show_lemma_coverage=0&clustercolls=0&minsim=0.15&structured=0&structured=1&min_unary_score=5.0&min_mwlink_freq=100&nr_ws_cols=5&bim_corpname=&bim_lemma= Grammatical and collocational behaviour of "መንግስቲ " ("government")] – An essential feature for creating dictionaries in Tigrinya. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/tigrinya_ws.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/thes?corpname=tiwac16&reload=&lemma=%E1%88%98%E1%8A%95%E1%8C%8D%E1%88%B5%E1%89%B2&maxthesitems=60&minthesscore=0.0&includeheadword=0&clusteritems=0&minsim=0.15 Words used in the same context as "መንግስቲ " ("government")] – A useful list for creating a Tigrinya thesaurus. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/tigrinya_thes.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/struct_wordlist?corpname=tiwac16&refs=%3Ddoc.t2ld&wlmaxitems=100&wlsort=f&subcnorm=freq&corpname=tiwac16&reload=&wlattr=tag&usengrams=0&ngrams_n=2&wlpat=NOUN&wlminfreq=5&wlmaxfreq=0&wlfile=&wlblacklist=&wlnums=frq&wltype=multilevel&wlstruct_attr1=word&wlstruct_attr2=&wlstruct_attr3= List of Tigrinya nouns by frequency] – Useful for dictionary based applications, e.g. predictive text writing. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/tigrinya_frq.png)]] === Norwegian Bokmål Web Corpus === [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=notenten15_4_bokmal Browse the Norwegian Bokmål Web Corpus] * Corpus size: 1.4 billion tokens. * Crawled by !SpiderLing in Febreuary 2015. Boilerplate-cleaned, de-duplicated. Separated from Norwegian Nynorsk. * Tagged by [https://www.sketchengine.co.uk/norwegian-oslo-bergen-part-of-speech-tagset/ Oslo-Bergen Tagger]. ==== Examples of HaBiT System features for the Norwegian Bokmål Web Corpus ==== * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=notenten15_4_bokmal Corpus information] – Document count, sentence count, word count, lexicon sizes, tagset description. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/norwegian_bm_info.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=notenten15_4_bokmal&reload=&iquery=regjering&queryselector=iqueryrow&lemma=&lpos=&phrase=&word=&wpos=&char=&cql=&default_attr=word&fc_lemword_window_type=both&fc_lemword_wsize=5&fc_lemword=&fc_lemword_type=all&fc_pos_window_type=both&fc_pos_wsize=5&fc_pos_type=all&fsca_doc.t2ld=&fsca_doc.urldomain= Examples of the use of "regjering" ("government") in context] – Words or phrases in a natural Norwegian Bokmål context. The base function for language study and the source of good dictionary examples. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/norwegian_bm_conc.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/wsketch?corpname=notenten15_4_bokmal&reload=&lemma=regjering&lpos=&minfreq=auto&minscore=0.0&maxitems=25&sort_ws_columns=s&show_lemma_coverage=0&clustercolls=0&minsim=0.15&structured=0&structured=1&min_unary_score=5.0&min_mwlink_freq=100&nr_ws_cols=4&bim_corpname=&bim_lemma= Grammatical and collocational behaviour of "regjering" ("government")] – An essential feature for creating dictionaries in Norwegian Bokmål. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/norwegian_bm_ws.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/thes?corpname=notenten15_4_bokmal&reload=&lemma=regjering&lpos=-n&maxthesitems=60&minthesscore=0.0&includeheadword=0&clusteritems=0&minsim=0.15 Words used in the same context as "regjering" ("government")] – A useful list for creating a Norwegian Bokmål thesaurus. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/norwegian_bm_thes.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/freqml?q=aword%2C%5Btag%3D%22subst%22%5D&corpname=notenten15_4_bokmal&viewmode=sen&attrs=word&ctxattrs=word&structs=p%2Cg&refs=%3Ddoc.urldomain&pagesize=40&gdexconf=&attr_tooltip=nott&ml=1&flimit=0&freqlevel=1&ml1attr=lemma&ml1ctx=0~0%3E0&ml2attr=word&ml2ctx=0~0%3E0&ml3attr=word&ml3ctx=0~0%3E0&ml4attr=word&ml4ctx=0~0%3E0 List of Norwegian Bokmål nouns by frequency] – Useful for dictionary based applications, e.g. predictive text writing. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/norwegian_bm_frq.png)]] === Norwegian Nynorsk Web Corpus === [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=notenten15_4_nynorsk Browse the Norwegian Nynorsk Web Corpus] * Corpus size: 1.4 billion tokens. * Crawled by !SpiderLing in Febreuary 2015. Boilerplate-cleaned, de-duplicated. Separated from Norwegian Bokmål. * Tagged by [https://www.sketchengine.co.uk/norwegian-oslo-bergen-part-of-speech-tagset/ Oslo-Bergen Tagger]. ==== Examples of HaBiT System features for the Norwegian Nynorsk Web Corpus ==== * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=notenten15_4_nynorsk Corpus information] – Document count, sentence count, word count, lexicon sizes, tagset description. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/norwegian_nn_info.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=notenten15_4_nynorsk&reload=&iquery=regjering&queryselector=iqueryrow&phrase=&word=&char=&cql=&default_attr=word&fc_lemword_window_type=both&fc_lemword_wsize=5&fc_lemword=&fc_lemword_type=all&fsca_doc.t2ld=&fsca_doc.urldomain= Examples of the use of "regjering" ("government") in context] – Words or phrases in a natural Norwegian Nynorsk context. The base function for language study and the source of good dictionary examples. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/norwegian_nn_conc.png)]] === Czech Web Corpus === [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=cztenten16_0 Browse the Czech Web Corpus] * Corpus size: 9.3 billion tokens. * Crawled by !SpiderLing November and December 2015 and October to December 2016. Boilerplate-cleaned, de-duplicated. * Tagged by [https://www.sketchengine.co.uk/tagset-reference-for-czech Czech POS tagger Majka]. ==== Examples of HaBiT System features for the Czech Web Corpus ==== * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=cztenten16_0 Corpus information] – Document count, sentence count, word count, lexicon sizes, tagset description. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/czech_info.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/reduce?q=aword%2C%5Blc%3D%22vl%C3%A1da%22%7Clemma_lc%3D%22vl%C3%A1da%22%5D&q=Fdoc&corpname=cztenten16_0&viewmode=sen&attrs=word&ctxattrs=word&structs=p%2Cg&refs=doc&pagesize=40&gdexconf=&iquery=vl%C3%A1da&attr_tooltip=nott&rlines=250 Examples of the use of "vláda" ("government") in context] – Words or phrases in a natural Czech context. The base function for language study and the source of good dictionary examples. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/czech_conc.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/wsketch?corpname=cztenten16_0&reload=&lemma=vl%C3%A1da&minfreq=auto&minscore=0.0&maxitems=25&sort_ws_columns=s&show_lemma_coverage=0&clustercolls=0&minsim=0.15&structured=0&structured=1&min_unary_score=5.0&min_mwlink_freq=100&nr_ws_cols=5&bim_corpname=&bim_lemma= Grammatical and collocational behaviour of "vláda" ("government")] – An essential feature for creating dictionaries in Czech. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/czech_ws.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/thes?corpname=cztenten16_0&reload=&lemma=vl%C3%A1da&maxthesitems=60&minthesscore=0.0&includeheadword=0&clusteritems=0&minsim=0.15 Words used in the same context as "vláda" ("government")] – A useful list for creating a Czech thesaurus. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/czech_thes.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/freqml?q=aword%2C%5Btag%3D%22k1.*%22%5D&corpname=cztenten16_0&viewmode=sen&attrs=word&ctxattrs=word&structs=p%2Cg&refs=doc&pagesize=40&gdexconf=&attr_tooltip=nott&ml=1&flimit=0&freqlevel=1&ml1attr=lemma&ml1ctx=0~0%3E0&ml2attr=word&ml2ctx=0~0%3E0&ml3attr=word&ml3ctx=0~0%3E0&ml4attr=word&ml4ctx=0~0%3E0 List of Czech nouns by frequency] – Useful for dictionary based applications, e.g. predictive text writing. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/czech_frq.png)]] === Czech-!Norwegian/Norwegian-Czech Parallel Corpus === [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=czech_norwegian_opus__czech Browse the Czech-Norwegian Parallel Corpus], or [http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=czech_norwegian_opus__norwegian the Norwegian-Czech Parallel Corpus] * Corpus size: 32 million tokens. * Created from subtitles, OpenSubtitles2016 subcorpus of OPUS2, filtered for Czech and Norwegian. * [ParallelCzechNorwegian Corpus deliverable/technical report] ==== Examples of HaBiT System features for the Czech-!Norwegian/Norwegian-Czech Parallel Corpus ==== * [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=czech_norwegian_opus__czech Czech-Norwegian Parallel Corpus information], [http://corpora.fi.muni.cz/habit/run.cgi/corp_info?corpname=czech_norwegian_opus__norwegian Norwegian-Czech Parallel Corpus information] – Document count, word count, lexicon sizes, tagset description. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/par_czech_norwegian_info.png)]] [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/par_norwegian_czech_info.png)]] * [http://corpora.fi.muni.cz/habit/run.cgi/view?q=aword%2C%5Blc%3D%22vl%C3%A1da%22%7Clemma_lc%3D%22vl%C3%A1da%22%5D+within+czech_norwegian_opus__norwegian%3A%5Blc%3D%22regjering%22%5D;corpname=czech_norwegian_opus__czech;viewmode=align;attrs=word&ctxattrs=word&structs=p%2Cg&refs=align&pagesize=40&align=czech_norwegian_opus__norwegian&gdexconf=&iquery=vl%C3%A1da&maincorp=czech_norwegian_opus__czech&attr_tooltip=nott;fromp=1 Examples of the use of Czech "vláda" ("government") with aligned segments of Norwegian "regjering" ("government") in context] – Words or phrases in a natural Czech and Norwegian context. The base function for language study and translation services. [[BR]] [[Image(https://nlp.fi.muni.cz/projects/habit/screenshots/par_czech_norwegian_conc.png)]]