= D1.1.2 Specification of corpora and the corpus building module =
= Web corpus creation process =
The internet is used by computational linguists as an immensely large source of text data for various NLP tasks and language studies. Web corpora can be built in sizes which would be virtually impossible to achieve using traditional corpus creation methods.

[[BR]]To build a large collection of texts from the web, one needs to master the following general steps:

 * Identify suitable documents to obtain.


 * Download the selected data from the internet, keeping important metadata such as the source URL and the date of acquisition.


 * Process the obtained data by stripping off non textual parts, clearing away boilerplate and unwanted text parts, removing duplicate parts, and other possible methods to get quality data in the result.


 * Store the result in a way enabling access according to the desired purpose, reusability and ability to process the raw internet data again.



== Obtaining Text Data from the Web ==
There are billions of documents available on the web. The process of traversing the web and downloading data (crawling) is a time and resource consuming task. A web crawler is a piece of software made for the task of crawling the internet.  The crawler is usually initialized by a set of starting internet points, the seed URLs. It downloads each document from the initial set, extracts links to other documents from the data and continues its work with the discovered set of new URLs.

[[BR]]Running a one-time web crawling, if the same URL is discovered again, it is considered as duplicate and discarded. Text data is collected once in this project. Repeated crawls to keep multiple versions of the same web pages are not required. A `snapshot' of a part of the web in the target languages is created.

[[BR]]The crawling strategy – making decisions which parts of the web to explore first, i.e. which documents to download immediately and which postpone for obtaining later – is a very important factor in the design of a successful crawler. The implemented traversing algorithm is crucial in achieving wide coverage of web domains or higher crawling effectivity; higher amount of extracted data or catching `important' web pages, whichever is a priority. !SpiderLing, the crawler enhanced and used within this project, focuses on downloading documents from web domains yielding much text in target languages.

[[BR]]Additional issues have to be taken into account when crawling the web: not overusing the source servers by obeying the Robots exclusion protocol, boilerplate removal and content de-duplication (if desired), robust postprocessing of crawled data (e.g. dealing with malformed data, language detection, character encoding detection).

[[BR]]
== Target languages ==
The target languages of this project are Amharic, Czech, Norwegian, Oromo, Somali and Tigrinya:

|| Language || Spoken in countries || Top level domains || Notes important for the process ||
|| Amharic || Ethiopia || et || Ge’ez script ||
|| Czech || Czech Republic || cz || Similar to Slovak ||
|| Norwegian || Norway || no || Two written forms, similar to Danish ||
|| Oromo || Ethiopia, Somalia || et, so || ||
|| Somali || Djibouti, Somalia || dj, so || ||
|| Tigrinya || Eritrea, Ethiopia || et, er || Ge’ez script ||

[[BR]]Unlike Norwegian and Czech, the African languages in this set can be considered low resourced – there are no large corpora available. Also, the presence of texts in Amharic, Oromo and Tigrinya is quite limited.

[[BR]]In the case of the African languages, the crawler was constrained to downloading from top level domains et, er, so, dj to improve the efficiency of the process. Generic domains com, org, net, gov, info, edu – frequent in seed URLs – were allowed to improve the recall.

== Obtaining low resourced language resources ==
The crawler has been modified to improve yield of the process for low resourced languages:

 * The traversing strategy allows continuous downloading from web domains yielding a small amount of text.


 * The scheduler does not limit the number of documents downloaded from a single web domain.


 * The crawler gathered documents from all target national domains in a single run. The documents were split to paragraphs. Every paragraph was separated according to its language. This way documents containing paragraphs in various languages were used the optimal way.



== Obtaining the seed URLs ==
Starting the crawl with a good, text yielding and trustworthy sources can positively benefit the result corpus. Exploiting web search engines a way to identify relevant documents in the web. The search engine is expected to supply good text data in the desired language based on search parameters.

[[BR]]`WebBootCaT' is a tool for bootstrapping corpora and terms from the web (an extension of a method devised by Baroni and a module in !SketchEngine corpus query system nowadays). It allows quick and effortless focused web corpus building. A similar approach in a much larger scale was used later by the !ClueWeb project. It started with two types of seed URLs: one from an earlier 200 million page crawl, another given by commercial search engines (Google, Yahoo). The search engines were queried using most frequent queries and random word queries for each target language.

[[BR]]We took the same approach: previous crawls (in the case of Norwegian and Czech) as well as URLs suggested by search engine Bing were used to initialise the crawler. As for the African languages, the process follows:

 * Bigrams of words from An Crúbadán were sorted by frequency.


 * Items at from rank 200 to rank 1100 were filtered for whole words (e.g. punctuation and end of lines were removed).


 * The list was manually filtered by native speakers. (Thanks to Derib Ado and Feda Negesse from Addis Ababa University.)


 * The clean lists (300 to 500 pairs of words) were used to query the search engine.


 * The result URLs were employed as starting points for the crawler.



== Using language dependent models for web text cleaning ==
We observe the world wide web has become the source of data preferred by many for NLP oriented research. The content of the web is not regulated in terms of data quality, originality, or correct description.

[[BR]]In the first phase, the document language and character encoding have to be identified. In the case of a single target language, data in other languages are stripped off. The document encoding can be normalized to UTF-8 which is the most spread encoding standard capable of addressing all necessary character codepoints.

[[BR]]A boilerplate consists in parts repeated on many web pages within a web site, e.g. header, navigation menus, footer, advertisement, lists of links, decorative elements, snippets of related articles. Boilerplate text distort the statistics (e.g. the distributional similarity information) of the corpus, the count of repeated terms gives biased information about the language and makes the corpus search provide no useful evidence about the phenomenon being investigated.

[[BR]]Another important issue is duplicity. The digital information on the web is easily copied and thus documents may have significant overlaps. That might happen in the case of multiple online newspapers sharing the same announcement released by a press agency, with document revisions, or quotations of previous posts in online discussion forums. Removing duplicate and near-duplicate texts is therefore essential to avoid e.g. duplicate concordance lines and to prevent biased results derived from statistical processing of corpus data suffering from artificially inflated frequencies of some words and expressions. We adhere to removing of paragraphs consisting of more than 50 % word 7-tuples encountered in previously processed data.


== HaBiT corpus creation pipeline ==
To summarise, the data retrieval software and text cleaning tools used within the corpus creation pipeline in this project are:
 * WebBootCaT, a search engine querying and corpus building tool.
 * SpiderLing, a web crawler for linguistic purpose.
 * Character encoding detection using a byte trigram model.
 * Language identification using a wordlist and a character trigram model.
 * Justext, a boilerplate removal tool using a heuristic algorithm.
 * Onion, a de-duplication tool removing both identical and similar documents, on the paragraph level.

Notes: The models used in the process were trained on subsets of documents found by the search engine in the target languages. Malformed data (e.g. wrong spelling of words) is not dealt with.