Context Navigation

Changes between Version 3 and Version 4 of CorporaAndCorpusBuilding

Timestamp:: Jan 21, 2016, 10:32:28 AM (10 years ago)
Author:: xsuchom2
Comment:: aktualizace

Legend:

: Unmodified
: Added
: Removed
: Modified

CorporaAndCorpusBuilding

-              v3
+              v4
 = D1.1.2 Specification of corpora and the corpus building module =
 = Web corpus creation process =
+= Web Corpus Creation Process =
 The internet is used by computational linguists as an immensely large source of text data for various NLP tasks and language studies. Web corpora can be built in sizes which would be virtually impossible to achieve using traditional corpus creation methods.
+[[BR]]To build a large collection of texts from the web, one needs to master the following general steps:
+To build a large collection of texts from the web, one needs to master the following general steps:
+* Identify suitable documents to obtain. In the case of this project, all documents containing text in target languages are suitable.
+* Download the selected data from the internet, keeping important metadata such as the source URL and the date of acquisition.
+* Process the obtained data by stripping off non textual parts, clearing away boilerplate and unwanted text parts, removing duplicate parts, and other possible methods to get quality data in the result.
+* Store the result in a way enabling access according to the desired purpose, reusability and ability to process the raw internet data again. The usual way to store and access a corpus is to employ a corpus manager.
+ * Identify suitable documents to obtain.
+ * Download the selected data from the internet, keeping important metadata such as the source URL and the date of acquisition.
+ * Process the obtained data by stripping off non textual parts, clearing away boilerplate and unwanted text parts, removing duplicate parts, and other possible methods to get quality data in the result.
+ * Store the result in a way enabling access according to the desired purpose, reusability and ability to process the raw internet data again.
+== Obtaining Text Data from the Web ==
+There are billions of documents available on the web. The process of traversing the web and downloading data (crawling) is a time and resource consuming task. A web crawler is a piece of software made for the task of crawling the internet.  The crawler is usually initialized by a set of starting internet points, the seed URLs. It downloads each document from the initial set, extracts links to other documents from the data and continues its work with the discovered set of new URLs.
+[[BR]]Running a one-time web crawling, if the same URL is discovered again, it is considered as duplicate and discarded. Text data is collected once in this project. Repeated crawls to keep multiple versions of the same web pages are not required. A `snapshot' of a part of the web in the target languages is created.
+[[BR]]The crawling strategy – making decisions which parts of the web to explore first, i.e. which documents to download immediately and which postpone for obtaining later – is a very important factor in the design of a successful crawler. The implemented traversing algorithm is crucial in achieving wide coverage of web domains or higher crawling effectivity; higher amount of extracted data or catching `important' web pages, whichever is a priority. !SpiderLing, the crawler enhanced and used within this project, focuses on downloading documents from web domains yielding much text in target languages.
+[[BR]]Additional issues have to be taken into account when crawling the web: not overusing the source servers by obeying the Robots exclusion protocol, boilerplate removal and content de-duplication (if desired), robust postprocessing of crawled data (e.g. dealing with malformed data, language detection, character encoding detection).
+[[BR]]
+== Target languages ==
+= Target Languages =
 The target languages of this project are Amharic, Czech, Norwegian, Oromo, Somali and Tigrinya:
 || Language || Spoken in countries || Top level domains || Notes important for the process ||
 || Amharic || Ethiopia || et || Ge’ez script ||
 …
 || Tigrinya || Eritrea, Ethiopia || et, er || Ge’ez script ||
 [[BR]]Unlike Norwegian and Czech, the African languages in this set can be considered low resourced – there are no large corpora available. Also, the presence of texts in Amharic, Oromo and Tigrinya is quite limited.
+Unlike Norwegian and Czech, the African languages in this set can be considered low resourced – there are no large corpora available. Also, the presence of texts in Amharic, Oromo and Tigrinya is quite limited.
+[[BR]]In the case of the African languages, the crawler was constrained to downloading from top level domains et, er, so, dj to improve the efficiency of the process. Generic domains com, org, net, gov, info, edu – frequent in seed URLs – were allowed to improve the recall.
+= Obtaining Text Data from the Web =
+There are billions of documents available on the web. The process of traversing the web and downloading data (crawling) is a time and resource consuming task. A web crawler is a piece of software made for the task of crawling the internet.  The crawler is usually initialized by a set of starting internet points, the seed URLs. It downloads each document from the initial set, extracts links to other documents from the data and continues its work with the discovered set of new URLs.
+== Obtaining low resourced language resources ==
+The crawler has been modified to improve yield of the process for low resourced languages:
+Running a one-time web crawling, if the same URL is discovered again, it is considered as duplicate and discarded. Text data is collected once in this project. Repeated crawls to keep multiple versions of the same web pages are not required. A `snapshot' of a part of the web in the target languages is created.
  * The traversing strategy allows continuous downloading from web domains yielding a small amount of text.
+The crawling strategy – making decisions which parts of the web to explore first, i.e. which documents to download immediately and which postpone for obtaining later – is a very important factor in the design of a successful crawler. The implemented traversing algorithm is crucial in achieving wide coverage of web domains or higher crawling effectivity; higher amount of extracted data or catching `important' web pages, whichever is a priority. SpiderLing, the crawler enhanced and used within this project, focuses on downloading documents from web domains yielding much text in target languages.
+Additional issues have to be taken into account when crawling the web: not overusing the source servers by obeying the Robots exclusion protocol, boilerplate removal and content de-duplication (if desired), robust postprocessing of crawled data (e.g. dealing with malformed data, language detection, character encoding detection).
+ * The scheduler does not limit the number of documents downloaded from a single web domain.
+ * The crawler gathered documents from all target national domains in a single run. The documents were split to paragraphs. Every paragraph was separated according to its language. This way documents containing paragraphs in various languages were used the optimal way.
+== Obtaining the seed URLs ==
+= Obtaining Seed URLs =
 Starting the crawl with a good, text yielding and trustworthy sources can positively benefit the result corpus. Exploiting web search engines a way to identify relevant documents in the web. The search engine is expected to supply good text data in the desired language based on search parameters.
 [[BR]]`WebBootCaT' is a tool for bootstrapping corpora and terms from the web (an extension of a method devised by Baroni and a module in !SketchEngine corpus query system nowadays). It allows quick and effortless focused web corpus building. A similar approach in a much larger scale was used later by the !ClueWeb project. It started with two types of seed URLs: one from an earlier 200 million page crawl, another given by commercial search engines (Google, Yahoo). The search engines were queried using most frequent queries and random word queries for each target language.
+WebBootCaT is a tool for bootstrapping corpora and terms from the web (an extension of a method devised by Baroni and a module in SketchEngine corpus query system nowadays). It allows quick and effortless focused web corpus building. A similar approach in a much larger scale was used later by the ClueWeb project. It started with two types of seed URLs: one from an earlier 200 million page crawl, another given by commercial search engines. The search engines were queried using most frequent queries and random word queries for each target language.
+[[BR]]We took the same approach: previous crawls (in the case of Norwegian and Czech) as well as URLs suggested by search engine Bing were used to initialise the crawler. As for the African languages, the process follows:
+We took the same approach: previous crawls (in the case of Norwegian and Czech) as well as URLs suggested by search engine Bing were used to initialise the crawler. As for the African languages, the process follows:
+* Bigrams of words from An Crúbadán were sorted by frequency.
+* Items at from rank 200 to rank 1100 were filtered for whole words (e.g. punctuation and end of lines were removed).
+* The list was manually filtered by native speakers.
+* The clean lists (300 to 500 pairs of words) were used to query the search engine.
+* The result URLs were employed as starting points for the crawler.
+ * Bigrams of words from An Crúbadán were sorted by frequency.
+|| Language || Filtered word bigrams || Seed URLs found by the search engine || Most frequent web domains in seed URLs yielding text ||
+|| Amharic || 354 || 6,905 || am.wikipedia.org, www.zehabesha.com, andadirgen.blogspot.com, plus.google.com, wol.jw.org, www.dejeselam.org, ethsat.com ||
+|| Oromo || 366 || 9,843 || www.voaafaanoromoo.com, www.youtube.com, finfinnetribune.com, gadaa.com, www.bilisummaa.com, www.oromoliberationfront.org, qeerroo.org ||
+|| Somali || 432 || 18,087 || www.youtube.com, salaanmedia.com, waajid.wordpress.com, so.wikipedia.org, www.ogadennet.com, www.bbc.com, geeska.net ||
+|| Tigrinya || 424 || 9,007 || assenna.com, www.youtube.com, tigrigna.voanews.com, www.jeberti.com, www.betezion.com, demhitonline.blogspot.com, www.gereger.com ||
+ * Items at from rank 200 to rank 1100 were filtered for whole words (e.g. punctuation and end of lines were removed).
+ * The list was manually filtered by native speakers. (Thanks to Derib Ado and Feda Negesse from Addis Ababa University.)
+ * The clean lists (300 to 500 pairs of words) were used to query the search engine.
+ * The result URLs were employed as starting points for the crawler.
+== Using language dependent models for web text cleaning ==
+= Using Language Dependent Models for Web Text Cleaning =
 We observe the world wide web has become the source of data preferred by many for NLP oriented research. The content of the web is not regulated in terms of data quality, originality, or correct description.
 [[BR]]In the first phase, the document language and character encoding have to be identified. In the case of a single target language, data in other languages are stripped off. The document encoding can be normalized to UTF-8 which is the most spread encoding standard capable of addressing all necessary character codepoints.
+In the first phase, the document language and character encoding have to be identified. In the case of a single target language, data in other languages are stripped off. The document encoding can be normalized to UTF-8 which is the most spread encoding standard capable of addressing all necessary character codepoints.
 [[BR]]A boilerplate consists in parts repeated on many web pages within a web site, e.g. header, navigation menus, footer, advertisement, lists of links, decorative elements, snippets of related articles. Boilerplate text distort the statistics (e.g. the distributional similarity information) of the corpus, the count of repeated terms gives biased information about the language and makes the corpus search provide no useful evidence about the phenomenon being investigated.
+A boilerplate consists in parts repeated on many web pages within a web site, e.g. header, navigation menus, footer, advertisement, lists of links, decorative elements, snippets of related articles. Boilerplate text distort the statistics (e.g. the distributional similarity information) of the corpus, the count of repeated terms gives biased information about the language and makes the corpus search provide no useful evidence about the phenomenon being investigated.
 [[BR]]Another important issue is duplicity. The digital information on the web is easily copied and thus documents may have significant overlaps. That might happen in the case of multiple online newspapers sharing the same announcement released by a press agency, with document revisions, or quotations of previous posts in online discussion forums. Removing duplicate and near-duplicate texts is therefore essential to avoid e.g. duplicate concordance lines and to prevent biased results derived from statistical processing of corpus data suffering from artificially inflated frequencies of some words and expressions. We adhere to removing of paragraphs consisting of more than 50 % word 7-tuples encountered in previously processed data.
+Another important issue is duplicity. The digital information on the web is easily copied and thus documents may have significant overlaps. That might happen in the case of multiple online newspapers sharing the same announcement released by a press agency, with document revisions, or quotations of previous posts in online discussion forums. Removing duplicate and near-duplicate texts is therefore essential to avoid e.g. duplicate concordance lines and to prevent biased results derived from statistical processing of corpus data suffering from artificially inflated frequencies of some words and expressions. We adhere to removing of paragraphs consisting of more than 50 % word 7-tuples encountered in previously processed data.
+Wrong spelling of words and other kinds of malformed data are not dealt with.
 == HaBiT corpus creation pipeline ==
 To summarise, the data retrieval software and text cleaning tools used within the corpus creation pipeline in this project are:
  * WebBootCaT, a search engine querying and corpus building tool.
  * SpiderLing, a web crawler for linguistic purpose.
  * Character encoding detection using a byte trigram model.
  * Language identification using a wordlist and a character trigram model.
  * Justext, a boilerplate removal tool using a heuristic algorithm.
  * Onion, a de-duplication tool removing both identical and similar documents, on the paragraph level.
+= HaBiT Corpus Creation Pipeline =
+The current data retrieval software and text cleaning tools suitable for use within the corpus creation pipeline in this project are:
+. WebBootCaT, a search engine querying and corpus building tool. It queries a search engine with user specified words or phrases, obtains URLs of relevant documents fount by the search engine and downloads the documents.
+. Spiderling is a web spider for linguistics. It can crawl text-rich parts of the web and collect a lot of data suitable for text corpora.
+. Chared is a tool for detecting the character encoding of a text in a known language. The language of the text has to be specified as an input parameter so that correspondent language model can be used. The package contains models for a wide range of languages. In general, it should be more accurate than character encoding detection algorithms with no language constraints.
+. Justext is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. A HTML page is split to paragraphs and a context sensitive heuristic algorithm is employed to separate content from boilerplate.
+. Onion is a tool for removing duplicate parts from large collections of texts. One instance of duplicate parts is kept, others are marked or removed. The tool allows removing both identical and similar documents, on any level (document, paragraph, sentence — if present in the data). The de-duplication algorithm is based on comparing hashes of n-grams of words.
+. Unitok is a universal text tokeniser with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens (vertical format), while preserving XML-like tags containing metadata.
+Notes: The models used in the process were trained on subsets of documents found by the search engine in the target languages. Malformed data (e.g. wrong spelling of words) is not dealt with.
+= Obtaining Low Resourced Language Resources =
+Since African languages dealt with in the project are low resourced and not much represented on the web, the corpus creation pipeline has to be adapted accordingly.
+The crawler will be modified to improve yield of the process for low resourced languages:
+* The crawler will be constrained to downloading from top level domains et, er, so, dj to improve the efficiency of the process. Generic domains com, org, net, gov, info, edu – frequent in seed URLs – will be allowed to improve the recall.
+* The traversing strategy allows continuous downloading from web domains yielding a small amount of text.
+* The scheduler does not limit the number of documents downloaded from a single web domain.
+* The crawler gathered documents from all target national domains in a single run. The documents were split to paragraphs. Every paragraph was separated according to its language. This way documents containing paragraphs in various languages were used the optimal way.
+The models used in the process can be trained on texts from Wikipedia or on subsets of documents found by the search engine for all six target languages. There are two factors contributing to discerning target African languages thus making the corpus creation pipeline less accurate:
+. Some language are spoken in several countries (top level domains).
+. There are multiple languages spoken in a country.
+To mitigate the risk of determining the language of a text incorrectly, it is important to carefully check data used for building language dependent models.
+Web domains rich in text documents are worth analysing of the structure of their content. That might increase the amount of harvested data. For example, a web created using a content management system might offer a site map containing URLs of all documents within the site. Or there can be a sequence of numbers assigned to all documents on the site. In such cases, one can develop a script tailored for downloading from the particular web site reaching a higher efficiency than the level achieved by a general web crawler. That would lead to higher corpora sizes for languages with scarce presence in the internet.