Changes between Version 1 and Version 2 of SystemDesign


Ignore:
Timestamp:
Jan 15, 2016, 9:48:33 AM (8 years ago)
Author:
hales
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SystemDesign

    v1 v2  
    11= D1.1.1 System specifications: Overall system design definitions =
     2
     3HaBiT system is composed of mostly independent modules. Different tasks could be solved by combining several tools together. Most of the tools have two interfaces:
     4 1) unix command line -- a unix command (or script) uses data from the standard input and produces results to the standard output;
     5 2) web API -- the module is run as a web service using standard HTTP protocol and JSON format.
     6Each module could be developed and tested independently or using only small number of other modules.
     7
     8There are several groups of modules: corpus building, corpus searching, corpus exploitation.
     9
     10Corpus building modules:
     11 * **Spiderling** -- a web spider for linguistics. It can crawl text-rich parts of the web and collect a lot of data suitable for text corpora.
     12 * **Chared** -- a tool for detecting the character encoding of a text in a known language. It contains models for a wide range of languages.
     13 * **czaccent** -- adding diacritics into Czech texts without diacritics.
     14 * **!JusText** -- a HTML boilerplate removal tool. It can strip navigation links, headers, footers, etc. from HTML pages and leave just regular text containing full sentences.
     15 * **Unitok** -- a universal text tokeniser with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens (“vertical” format), while preserving XML-like tags containing metadata.
     16 * **Onion** (ONe Instance ONly) -- a de-duplicator for large collections of texts. It can measure the similarity of paragraphs or whole documents and drop duplicate ones based on the threshold you set.
     17 * PoS taggers -- automatic part-of-speech annotation
     18 * **compilecorp** -- corpus indexing tool. It creates supplemental data like indexes and frequency tables for faster corpus querying and exploitation.
     19
     20Corpus searching modules:
     21 * Sketch Engine
     22 * freqs
     23 * lscngr
     24
     25Corpus exploitation modules:
     26 * word sense disambiguation
     27 * keyword extraction