Context Navigation

Changes between Version 1 and Version 2 of SystemDesign

Timestamp:: Jan 15, 2016, 9:48:33 AM (9 years ago)
Author:: hales
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

SystemDesign

-                      v1
+                      v2
 = D1.1.1 System specifications: Overall system design definitions =
+HaBiT system is composed of mostly independent modules. Different tasks could be solved by combining several tools together. Most of the tools have two interfaces:
+) unix command line -- a unix command (or script) uses data from the standard input and produces results to the standard output;
+) web API -- the module is run as a web service using standard HTTP protocol and JSON format.
+Each module could be developed and tested independently or using only small number of other modules.
+There are several groups of modules: corpus building, corpus searching, corpus exploitation.
+Corpus building modules:
+ * **Spiderling** -- a web spider for linguistics. It can crawl text-rich parts of the web and collect a lot of data suitable for text corpora.
+ * **Chared** -- a tool for detecting the character encoding of a text in a known language. It contains models for a wide range of languages.
+ * **czaccent** -- adding diacritics into Czech texts without diacritics.
+ * **!JusText** -- a HTML boilerplate removal tool. It can strip navigation links, headers, footers, etc. from HTML pages and leave just regular text containing full sentences.
+ * **Unitok** -- a universal text tokeniser with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens (“vertical” format), while preserving XML-like tags containing metadata.
+ * **Onion** (ONe Instance ONly) -- a de-duplicator for large collections of texts. It can measure the similarity of paragraphs or whole documents and drop duplicate ones based on the threshold you set.
+ * PoS taggers -- automatic part-of-speech annotation
+ * **compilecorp** -- corpus indexing tool. It creates supplemental data like indexes and frequency tables for faster corpus querying and exploitation.
+Corpus searching modules:
+ * Sketch Engine
+ * freqs
+ * lscngr
+Corpus exploitation modules:
+ * word sense disambiguation
+ * keyword extraction