Changes between Version 2 and Version 3 of CorporaAndCorpusBuilding


Ignore:
Timestamp:
Jan 19, 2016, 8:18:04 PM (9 years ago)
Author:
xsuchom2
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • CorporaAndCorpusBuilding

    v2 v3  
    6969
    7070
    71  * The list was manually filtered by native speakers (TODO acknowledge).
     71 * The list was manually filtered by native speakers. (Thanks to Derib Ado and Feda Negesse from Addis Ababa University.)
    7272
    7373
     
    8686[[BR]]A boilerplate consists in parts repeated on many web pages within a web site, e.g. header, navigation menus, footer, advertisement, lists of links, decorative elements, snippets of related articles. Boilerplate text distort the statistics (e.g. the distributional similarity information) of the corpus, the count of repeated terms gives biased information about the language and makes the corpus search provide no useful evidence about the phenomenon being investigated.
    8787
    88 [[BR]]Another important issue is duplicity. The digital information on the web is easily copied and thus documents may have significant overlaps. That might happen in the case of multiple online newspapers sharing the same announcement released by a press agency, with document revisions, or quotations of previous posts in online discussion forums. Removing duplicate and near-duplicate texts is therefore essential to avoid e.g. duplicate concordance lines and to prevent biased results derived from statistical processing of corpus data suffering from artificially inflated frequencies of some words and expressions. We adhere to removing of paragraphs consisting of more than 50\% word 7-tuples encountered in previously processed data.
    89 
    90 [[BR]]To summarise, the data cleaning issues dealt within the corpus creation pipeline in this project are:
    91 
    92  * character encoding detection using a byte trigram model,
     88[[BR]]Another important issue is duplicity. The digital information on the web is easily copied and thus documents may have significant overlaps. That might happen in the case of multiple online newspapers sharing the same announcement released by a press agency, with document revisions, or quotations of previous posts in online discussion forums. Removing duplicate and near-duplicate texts is therefore essential to avoid e.g. duplicate concordance lines and to prevent biased results derived from statistical processing of corpus data suffering from artificially inflated frequencies of some words and expressions. We adhere to removing of paragraphs consisting of more than 50 % word 7-tuples encountered in previously processed data.
    9389
    9490
    95  * language identification using a wordlist and a character trigram model,
     91== HaBiT corpus creation pipeline ==
     92To summarise, the data retrieval software and text cleaning tools used within the corpus creation pipeline in this project are:
     93 * WebBootCaT, a search engine querying and corpus building tool.
     94 * SpiderLing, a web crawler for linguistic purpose.
     95 * Character encoding detection using a byte trigram model.
     96 * Language identification using a wordlist and a character trigram model.
     97 * Justext, a boilerplate removal tool using a heuristic algorithm.
     98 * Onion, a de-duplication tool removing both identical and similar documents, on the paragraph level.
    9699
    97 
    98  * boilerplate removal using a heuristic tool Justext,
    99 
    100 
    101  * de-duplication of both identical and similar documents, on the paragraph level, using tool Onion,
    102 
    103 
    104  * malformed data (e.g. wrong spelling of words) is not dealt with.
    105 
    106 
    107 
    108 The models used in the process were trained on subsets of documents found by the search engine in the target languages.
     100Notes: The models used in the process were trained on subsets of documents found by the search engine in the target languages. Malformed data (e.g. wrong spelling of words) is not dealt with.