88 | | [[BR]]Another important issue is duplicity. The digital information on the web is easily copied and thus documents may have significant overlaps. That might happen in the case of multiple online newspapers sharing the same announcement released by a press agency, with document revisions, or quotations of previous posts in online discussion forums. Removing duplicate and near-duplicate texts is therefore essential to avoid e.g. duplicate concordance lines and to prevent biased results derived from statistical processing of corpus data suffering from artificially inflated frequencies of some words and expressions. We adhere to removing of paragraphs consisting of more than 50\% word 7-tuples encountered in previously processed data. |
89 | | |
90 | | [[BR]]To summarise, the data cleaning issues dealt within the corpus creation pipeline in this project are: |
91 | | |
92 | | * character encoding detection using a byte trigram model, |
| 88 | [[BR]]Another important issue is duplicity. The digital information on the web is easily copied and thus documents may have significant overlaps. That might happen in the case of multiple online newspapers sharing the same announcement released by a press agency, with document revisions, or quotations of previous posts in online discussion forums. Removing duplicate and near-duplicate texts is therefore essential to avoid e.g. duplicate concordance lines and to prevent biased results derived from statistical processing of corpus data suffering from artificially inflated frequencies of some words and expressions. We adhere to removing of paragraphs consisting of more than 50 % word 7-tuples encountered in previously processed data. |
95 | | * language identification using a wordlist and a character trigram model, |
| 91 | == HaBiT corpus creation pipeline == |
| 92 | To summarise, the data retrieval software and text cleaning tools used within the corpus creation pipeline in this project are: |
| 93 | * WebBootCaT, a search engine querying and corpus building tool. |
| 94 | * SpiderLing, a web crawler for linguistic purpose. |
| 95 | * Character encoding detection using a byte trigram model. |
| 96 | * Language identification using a wordlist and a character trigram model. |
| 97 | * Justext, a boilerplate removal tool using a heuristic algorithm. |
| 98 | * Onion, a de-duplication tool removing both identical and similar documents, on the paragraph level. |