= D1.2.3 The HaBiT system v3: Third and pre-final integrated system prototype = == Corpus building tools == The HaBiT system prototype is accessible at ​[http://corpora.fi.muni.cz/habit/]. === Software === * WebBootCaT for HaBiT, a search engine querying and document downloading tool. It queries a search engine with user specified words or phrases, obtains URLs of relevant documents fount by the search engine and downloads the documents. This tool is an implementation of ''Baroni, Kilgarriff, Pomikálek, Rychlý. "WebBootCaT: instant domain-specific corpora to support human translators." In Proceedings of EAMT, pp. 247-252. 2006.'' * !SpiderLing, a web spider for linguistics — is software for obtaining text from the web useful for building text corpora. Many documents on the web only contain material not suitable for text corpora, such as site navigation, lists of links, lists of products, and other kind of text not comprised of full sentences. In fact such pages represent the vast majority of the web. Therefore, by doing unrestricted web crawls, we typically download a lot of data which gets filtered out during post-processing. This makes the process of web corpus collection inefficient. * Chared is a tool for detecting the character encoding of a text in a known language. The language of the text has to be specified as an input parameter so that correspondent language model can be used. The package contains models for a wide range of languages. In general, it should be more accurate than character encoding detection algorithms with no language constraints. * Justext is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. A HTML page is split to paragraphs and a context sensitive heuristic algorithm is employed to separate content from boilerplate. * Onion is a tool for removing duplicate parts from large collections of texts. One instance of duplicate parts is kept, others are marked or removed. The tool allows removing both identical and similar documents, on any level (document, paragraph, sentence — if present in the data). The de-duplication algorithm is based on comparing hashes of n-grams of words. * Unitok is a universal text tokeniser with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens (vertical format), while preserving XML-like tags containing metadata. * [https://nlp.fi.muni.cz/projekty/habit/geez2sera/index.cgi geez2sera] is a tool for converting (transliterating) [https://en.wikipedia.org/wiki/Ge%27ez_script Geez] script into [http://abyssiniagateway.net/fidel/sera-faq_1.html SERA], System for Ethiopic Representation using Latin ASCII characters. It is based on [https://github.com/fgaim/HornMorpho L3 Python library] (GPL licence). * [https://nlp.fi.muni.cz/projekty/habit/amtag/index.cgi AmTag] is a tagger for Amharic. It is [http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ TreeTagger] with an Amharic model. * [http://code.activestate.com/recipes/326576-language-detection-using-character-trigrams/ Language identification using trigrams] (licensed under the [https://docs.python.org/3/license.html PSF licence]) === Language resources === ==== Wordlists, top 1000 items ==== * [raw-attachment:Amharic_geez.wl.txt Amharic wordlist, Geez'], [http://corpora.fi.muni.cz/habit/run.cgi/wordlist_form?corpname=amwac16 browse the Amharic wordlist online] * [raw-attachment:Amharic_sera.wl.txt Amharic wordlist, SERA], [http://corpora.fi.muni.cz/habit/run.cgi/wordlist_form?corpname=amwac16 browse the Amharic wordlist online] * [raw-attachment:Czech.wl.txt Czech wordlist], [http://corpora.fi.muni.cz/habit/run.cgi/wordlist_form?corpname=cztenten16_0 browse the Czech wordlist online] * [raw-attachment:Norwegian_Bokmal.wl.txt Norwegian Bokmål wordlist], [http://corpora.fi.muni.cz/habit/run.cgi/wordlist_form?corpname=notenten15_4_bokmal browse the Norwegian Bokmål wordlist online] * [raw-attachment:Norwegian_Nynorsk.wl.txt Norwegian Nynorsk wordlist], [http://corpora.fi.muni.cz/habit/run.cgi/wordlist_form?corpname=notenten15_4_nynorsk browse the Norwegian Nynorsk wordlist online] * [raw-attachment:Oromo.wl.txt Oromo wordlist], [http://corpora.fi.muni.cz/habit/run.cgi/wordlist_form?corpname=orwac16 browse the Oromo wordlist online] * [raw-attachment:Somali.wl.txt Somali wordlist], [http://corpora.fi.muni.cz/habit/run.cgi/wordlist_form?corpname=sowac16 browse the Somali wordlist online] * [raw-attachment:Tigrinya_geez.wl.txt Tigrinya wordlist, Geez'], [http://corpora.fi.muni.cz/habit/run.cgi/wordlist_form?corpname=tiwac16 browse the Tigrinya wordlist online] * [raw-attachment:Tigrinya_sera.wl.txt Tigrinya wordlist, SERA], [http://corpora.fi.muni.cz/habit/run.cgi/wordlist_form?corpname=tiwac16 browse the Tigrinya wordlist online] === Language dependent models and configuration files === * Language samples for language identification using a character trigram model: * [raw-attachment:Amharic_sample.txt​] * [raw-attachment:Arabic_sample.txt​] * [raw-attachment:Czech_sample.txt​] * [raw-attachment:Danish_sample.txt​] * [raw-attachment:English_sample.txt​] * [raw-attachment:Norwegian_Bokmal_sample.txt​] * [raw-attachment:Norwegian_Nynorsk_sample.txt​] * [raw-attachment:Oromo_sample.txt​] * [raw-attachment:Slovak_sample.txt​] * [raw-attachment:Somali_sample.txt​] * [raw-attachment:Tigrinya_sample.txt​] * Wordlists for language identification and boilerplate removal using Justext: * [raw-attachment:Amharic_justext.txt​] * [raw-attachment:Arabic_justext.txt​] * [raw-attachment:Czech_justext.txt​] * [raw-attachment:Danish_justext.txt​] * [raw-attachment:English_justext.txt​] * [raw-attachment:Norwegian_Bokmal_justext.txt​] * [raw-attachment:Norwegian_Nynorsk_justext.txt​] * [raw-attachment:Oromo_justext.txt​] * [raw-attachment:Slovak_justext.txt​] * [raw-attachment:Somali_justext.txt​] * [raw-attachment:Tigrinya_justext.txt​] * Byte trigram models for character encoding detection using chared: * [raw-attachment:Arabic.edm] * [raw-attachment:Czech.edm] * [raw-attachment:Danish.edm] * [raw-attachment:English.edm] * [raw-attachment:Norwegian.edm] * [raw-attachment:Slovak.edm] * [raw-attachment:Universal_UTF8.edm] -- used for Amharic, Oromo, Somali, Tigrinya * Tokenisation rules for tokenisation using Unitok: * [raw-attachment:unitok_Czech.py] * [raw-attachment:unitok_Ethiopian.py] -- used for Amharic, Oromo, Somali, Tigrinya * [raw-attachment:unitok_Norwegian.py] * !SpiderLing crawler configuration: * [raw-attachment:czech_spiderling_config.py] * [raw-attachment:ethiopian_spiderling_config.py] * [raw-attachment:norwegian_spiderling_config.py] Notes: * Arabic models were used for distinguishing Ethiopian languages from Arabic. * Danish models were used for distinguishing Norwegian from Danish. * English models were used for distinguishing all languages within the project from English. * Slovak models were used for distinguishing Czech from Slovak.