Context Navigation

Changes between Version 2 and Version 3 of CorporaAndCorpusBuilding

Timestamp:: Jan 19, 2016, 8:18:04 PM (9 years ago)
Author:: xsuchom2
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

CorporaAndCorpusBuilding

-                      v2
+                      v3
  * The list was manually filtered by native speakers (TODO acknowledge).
+ * The list was manually filtered by native speakers. (Thanks to Derib Ado and Feda Negesse from Addis Ababa University.)
 …
 [[BR]]A boilerplate consists in parts repeated on many web pages within a web site, e.g. header, navigation menus, footer, advertisement, lists of links, decorative elements, snippets of related articles. Boilerplate text distort the statistics (e.g. the distributional similarity information) of the corpus, the count of repeated terms gives biased information about the language and makes the corpus search provide no useful evidence about the phenomenon being investigated.
+[[BR]]Another important issue is duplicity. The digital information on the web is easily copied and thus documents may have significant overlaps. That might happen in the case of multiple online newspapers sharing the same announcement released by a press agency, with document revisions, or quotations of previous posts in online discussion forums. Removing duplicate and near-duplicate texts is therefore essential to avoid e.g. duplicate concordance lines and to prevent biased results derived from statistical processing of corpus data suffering from artificially inflated frequencies of some words and expressions. We adhere to removing of paragraphs consisting of more than 50\% word 7-tuples encountered in previously processed data.
+[[BR]]To summarise, the data cleaning issues dealt within the corpus creation pipeline in this project are:
+ * character encoding detection using a byte trigram model,
+[[BR]]Another important issue is duplicity. The digital information on the web is easily copied and thus documents may have significant overlaps. That might happen in the case of multiple online newspapers sharing the same announcement released by a press agency, with document revisions, or quotations of previous posts in online discussion forums. Removing duplicate and near-duplicate texts is therefore essential to avoid e.g. duplicate concordance lines and to prevent biased results derived from statistical processing of corpus data suffering from artificially inflated frequencies of some words and expressions. We adhere to removing of paragraphs consisting of more than 50 % word 7-tuples encountered in previously processed data.
+ * language identification using a wordlist and a character trigram model,
+== HaBiT corpus creation pipeline ==
+To summarise, the data retrieval software and text cleaning tools used within the corpus creation pipeline in this project are:
+ * WebBootCaT, a search engine querying and corpus building tool.
+ * SpiderLing, a web crawler for linguistic purpose.
+ * Character encoding detection using a byte trigram model.
+ * Language identification using a wordlist and a character trigram model.
+ * Justext, a boilerplate removal tool using a heuristic algorithm.
+ * Onion, a de-duplication tool removing both identical and similar documents, on the paragraph level.
+ * boilerplate removal using a heuristic tool Justext,
+ * de-duplication of both identical and similar documents, on the paragraph level, using tool Onion,
+ * malformed data (e.g. wrong spelling of words) is not dealt with.
+The models used in the process were trained on subsets of documents found by the search engine in the target languages.
+Notes: The models used in the process were trained on subsets of documents found by the search engine in the target languages. Malformed data (e.g. wrong spelling of words) is not dealt with.