Changes between Version 7 and Version 8 of HabitSystemV3


Ignore:
Timestamp:
May 31, 2017, 6:05:30 PM (7 years ago)
Author:
xsuchom2
Comment:

WebBootCaT for HaBiT

Legend:

Unmodified
Added
Removed
Modified
  • HabitSystemV3

    v7 v8  
    66
    77=== Software ===
    8  * WebBootCaT for HaBiT, a search engine querying and document downloading tool. It queries a search engine with user specified words or phrases, obtains URLs of relevant documents fount by the search engine and downloads the documents.
     8 * WebBootCaT for HaBiT, a search engine querying and document downloading tool. It queries a search engine with user specified words or phrases, obtains URLs of relevant documents fount by the search engine and downloads the documents. This tool is an implementation of ''Baroni, Kilgarriff, Pomikálek, Rychlý. "WebBootCaT: instant domain-specific corpora to support human translators." In Proceedings of EAMT, pp. 247-252. 2006.''
    99 * !SpiderLing, a web spider for linguistics — is software for obtaining text from the web useful for building text corpora. Many documents on the web only contain material not suitable for text corpora, such as site navigation, lists of links, lists of products, and other kind of text not comprised of full sentences. In fact such pages represent the vast majority of the web. Therefore, by doing unrestricted web crawls, we typically download a lot of data which gets filtered out during post-processing. This makes the process of web corpus collection inefficient.
    1010 * Chared is a tool for detecting the character encoding of a text in a known language. The language of the text has to be specified as an input parameter so that correspondent language model can be used. The package contains models for a wide range of languages. In general, it should be more accurate than character encoding detection algorithms with no language constraints.