| 2 | |
| 3 | HaBiT system is composed of mostly independent modules. Different tasks could be solved by combining several tools together. Most of the tools have two interfaces: |
| 4 | 1) unix command line -- a unix command (or script) uses data from the standard input and produces results to the standard output; |
| 5 | 2) web API -- the module is run as a web service using standard HTTP protocol and JSON format. |
| 6 | Each module could be developed and tested independently or using only small number of other modules. |
| 7 | |
| 8 | There are several groups of modules: corpus building, corpus searching, corpus exploitation. |
| 9 | |
| 10 | Corpus building modules: |
| 11 | * **Spiderling** -- a web spider for linguistics. It can crawl text-rich parts of the web and collect a lot of data suitable for text corpora. |
| 12 | * **Chared** -- a tool for detecting the character encoding of a text in a known language. It contains models for a wide range of languages. |
| 13 | * **czaccent** -- adding diacritics into Czech texts without diacritics. |
| 14 | * **!JusText** -- a HTML boilerplate removal tool. It can strip navigation links, headers, footers, etc. from HTML pages and leave just regular text containing full sentences. |
| 15 | * **Unitok** -- a universal text tokeniser with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens (“vertical” format), while preserving XML-like tags containing metadata. |
| 16 | * **Onion** (ONe Instance ONly) -- a de-duplicator for large collections of texts. It can measure the similarity of paragraphs or whole documents and drop duplicate ones based on the threshold you set. |
| 17 | * PoS taggers -- automatic part-of-speech annotation |
| 18 | * **compilecorp** -- corpus indexing tool. It creates supplemental data like indexes and frequency tables for faster corpus querying and exploitation. |
| 19 | |
| 20 | Corpus searching modules: |
| 21 | * Sketch Engine |
| 22 | * freqs |
| 23 | * lscngr |
| 24 | |
| 25 | Corpus exploitation modules: |
| 26 | * word sense disambiguation |
| 27 | * keyword extraction |