wiki:SystemDesign

D1.1.1 System specifications: Overall system design definitions

new version in attachment:del_1.1.1_v2.pdf:wiki:InterimResults

HaBiT system is composed of mostly independent modules. Different tasks could be solved by combining several tools together. Most of the tools have two interfaces:

1) unix command line -- a unix command (or script) uses data from the standard input and produces results to the standard output; 2) web API -- the module is run as a web service using standard HTTP protocol and JSON format.

Each module could be developed and tested independently or using only small number of other modules.

There are several groups of modules: corpus building, corpus searching, corpus exploitation.

Corpus building modules:

  • Spiderling -- a web spider for linguistics. It can crawl text-rich parts of the web and collect a lot of data suitable for text corpora.
  • Chared -- a tool for detecting the character encoding of a text in a known language. It contains models for a wide range of languages.
  • czaccent -- adding diacritics into Czech texts without diacritics.
  • JusText -- a HTML boilerplate removal tool. It can strip navigation links, headers, footers, etc. from HTML pages and leave just regular text containing full sentences.
  • Unitok -- a universal text tokeniser with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens (“vertical” format), while preserving XML-like tags containing metadata.
  • Onion (ONe Instance ONly) -- a de-duplicator for large collections of texts. It can measure the similarity of paragraphs or whole documents and drop duplicate ones based on the threshold you set.
  • PoS taggers -- automatic part-of-speech annotation
  • compilecorp -- corpus indexing tool. It creates supplemental data like indexes and frequency tables for faster corpus querying and exploitation.

Corpus searching modules:

  • Sketch Engine
  • freqs
  • lscngr

Corpus exploitation modules:

  • word sense disambiguation
  • keyword extraction
Last modified 7 years ago Last modified on Jan 17, 2017, 9:40:22 AM