Context Navigation

D1.1.2 Specification of corpora and the corpus building module

Web Corpus Creation Process

The internet is used by computational linguists as an immensely large source of text data for various NLP tasks and language studies. Web corpora can be built in sizes which would be virtually impossible to achieve using traditional corpus creation methods.

To build a large collection of texts from the web, one needs to master the following general steps:

Identify suitable documents to obtain. In the case of this project, all documents containing text in target languages are suitable.
Download the selected data from the internet, keeping important metadata such as the source URL and the date of acquisition.
Process the obtained data by stripping off non textual parts, clearing away boilerplate and unwanted text parts, removing duplicate parts, and other possible methods to get quality data in the result.
Store the result in a way enabling access according to the desired purpose, reusability and ability to process the raw internet data again. The usual way to store and access a corpus is to employ a corpus manager.

Target Languages

The target languages of this project are Amharic, Czech, Norwegian, Oromo, Somali and Tigrinya:

Language	Spoken in countries	Top level domains	Notes important for the process
Amharic	Ethiopia	et	Ge’ez script
Czech	Czech Republic	cz	Similar to Slovak
Norwegian	Norway	no	Two written forms, similar to Danish
Oromo	Ethiopia, Somalia	et, so
Somali	Djibouti, Somalia	dj, so
Tigrinya	Eritrea, Ethiopia	et, er	Ge’ez script

Unlike Norwegian and Czech, the African languages in this set can be considered low resourced – there are no large corpora available. Also, the presence of texts in Amharic, Oromo and Tigrinya on the web is quite limited.

Obtaining Text Data from the Web

There are billions of documents available on the web. The process of traversing the web and downloading data (crawling) is a time and resource consuming task. A web crawler is a piece of software made for the task of crawling the internet. The crawler is usually initialized by a set of starting internet points, the seed URLs. It downloads each document from the initial set, extracts links to other documents from the data and continues its work with the discovered set of new URLs.

Running a one-time web crawling, if the same URL is discovered again, it is considered as duplicate and discarded. Text data is collected once in this project. Repeated crawls to keep multiple versions of the same web pages are not required. A `snapshot' of a part of the web in the target languages is created.

The crawling strategy – making decisions which parts of the web to explore first, i.e. which documents to download immediately and which postpone for obtaining later – is a very important factor in the design of a successful crawler. The implemented traversing algorithm is crucial in achieving wide coverage of web domains or higher crawling effectivity; higher amount of extracted data or catching `important' web pages, whichever is a priority. SpiderLing?, the crawler enhanced and used within this project, focuses on downloading documents from web domains yielding much text in target languages.

Additional issues have to be taken into account when crawling the web: not overusing the source servers by obeying the Robots exclusion protocol, boilerplate removal and content de-duplication (if desired), robust postprocessing of crawled data (e.g., language detection, character encoding detection, dealing with malformed data if needed).

Obtaining Seed URLs

Starting the crawl with a good, text yielding and trustworthy sources can positively benefit the result corpus. Exploiting web search engines is a way to identify relevant documents in the web. The search engine is expected to supply good text data in the desired language based on search parameters.

WebBootCaT is a tool for bootstrapping corpora and terms from the web (an extension of a method devised by Baroni and a module embedded in SketchEngine? corpus query system). It allows quick and effortless focused web corpus building. A similar approach in a much larger scale was used later by the ClueWeb? project. It started with two types of seed URLs: one from an earlier 200 million page crawl, another given by commercial search engines. The search engines were queried using most frequent queries and random word queries for each target language.

We took the same approach: previous crawls (in the case of Norwegian and Czech) as well as URLs suggested by search engine Bing were used to initialise the crawler. As for the African languages, the process follows:

Bigrams of words from An Crúbadán were sorted by frequency.
Items at from rank 200 to rank 1100 were filtered for whole words (e.g. punctuation and end of lines were removed).
The list was manually filtered by native speakers.
The clean lists (300 to 500 pairs of words) were used to query the search engine.
The result URLs were employed as starting points for the crawler.

Language	Filtered word bigrams	Seed URLs found by the search engine	Most frequent web domains in seed URLs yielding text
Amharic	354	6,905	am.wikipedia.org, www.zehabesha.com, andadirgen.blogspot.com, plus.google.com, wol.jw.org, www.dejeselam.org, ethsat.com
Oromo	366	9,843	www.voaafaanoromoo.com, www.youtube.com, finfinnetribune.com, gadaa.com, www.bilisummaa.com, www.oromoliberationfront.org, qeerroo.org
Somali	432	18,087	www.youtube.com, salaanmedia.com, waajid.wordpress.com, so.wikipedia.org, www.ogadennet.com, www.bbc.com, geeska.net
Tigrinya	424	9,007	assenna.com, www.youtube.com, tigrigna.voanews.com, www.jeberti.com, www.betezion.com, demhitonline.blogspot.com, www.gereger.com

Using Language Dependent Models for Web Text Cleaning

We observe the world wide web has become the source of data preferred by many for NLP oriented research. The content of the web is not regulated in terms of data quality, originality, or correct description.

In the first phase, the document language and character encoding have to be identified. In the case of a single target language, data in other languages are stripped off. The document encoding can be normalized to UTF-8 which is the most spread encoding standard capable of addressing all necessary character codepoints.

A boilerplate consists in parts repeated on many web pages within a web site, e.g. header, navigation menus, footer, advertisement, lists of links, decorative elements, snippets of related articles. Boilerplate text distort the statistics (e.g. the distributional similarity information) of the corpus, the count of repeated terms gives biased information about the language and makes the corpus search provide no useful evidence about the phenomenon being investigated. Therefore it is necessary to remove boilerplate from the corpus data.

Another important issue is duplicity. The digital information on the web is easily copied and thus documents may have significant overlaps. That might happen in the case of multiple online newspapers sharing the same announcement released by a press agency, with document revisions, or quotations of previous posts in online discussion forums. Removing duplicate and near-duplicate texts is therefore essential to avoid e.g. duplicate concordance lines and to prevent biased results derived from statistical processing of corpus data suffering from artificially inflated frequencies of some words and expressions. We adhere to removing of paragraphs consisting of more than 50 % word 7-tuples encountered in previously processed data.

Filtering out paragraphs in a language similar (but not identical) to the desired language of a document might be necessary in the case of Norwegian (similar to Danish) and Czech (similar to Slovak). It can be achieved by counting words common in the target language but not present in the similar language (accepted words) and vice versa (rejected words). Only paragraphs containing much more accepted words than rejected words should be kept.

Wrong spelling of words and other kinds of malformed data are not dealt with in this project.

HaBiT Corpus Creation Pipeline

The current data retrieval software and text cleaning tools suitable for use within the corpus creation pipeline in this project are:

WebBootCaT, a search engine querying and corpus building tool. It queries a search engine with user specified words or phrases, obtains URLs of relevant documents found by the search engine and downloads the documents.
Spiderling is a web spider for linguistics. It can crawl text-rich parts of the web and collect a lot of data suitable for text corpora.
Chared is a tool for detecting the character encoding of a text in a known language. The language of the text has to be specified as an input parameter so that a correspondent language model can be used. The package contains models for a wide range of languages. In general, it should be more accurate than character encoding detection algorithms with no language constraints.
Justext is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. A HTML page is split to paragraphs and a context sensitive heuristic algorithm is employed to separate content from boilerplate.
Onion is a tool for removing duplicate parts from large collections of texts. One instance of duplicate parts is kept, others are marked or removed. The tool allows removing both identical and similar documents, on any level (document, paragraph, sentence — if present in the data). The de-duplication algorithm is based on comparing hashes of n-grams of words.
Unitok is a universal text tokeniser with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens (vertical format), while preserving XML-like tags containing metadata.

Obtaining Low Resourced Language Resources

Since African languages dealt with in the project are low resourced and not much represented on the web, the corpus creation pipeline has to be adapted accordingly.

The crawler will be modified to improve the yield of the process for low resourced languages:
The crawler will be constrained to downloading from top level domains et, er, so, dj to improve the efficiency of the process. Generic domains com, org, net, gov, info, edu – frequent in seed URLs – will be allowed to improve the recall.
The traversing strategy will allow continuous downloading from web domains yielding a small amount of text.
The scheduler will not limit the number of documents downloaded from a single web domain.
The crawler will gather documents from all target national domains in a single run. The documents were split to paragraphs. Every paragraph was separated according to its language. This way documents containing paragraphs in various languages can be used the optimal way.

The models used in the process will be trained on texts from Wikipedia or on subsets of documents found by the search engine for all six target languages. There are two factors contributing to discerning target African languages that make the corpus creation pipeline less accurate:

Some languages are spoken in several countries (top level domains).
There are multiple languages spoken in a country.

To mitigate the risk of determining the language of a text incorrectly, it is important to carefully check data used for building language dependent models.

Web domains rich in text documents are worth analysing of the structure of their content. That might increase the amount of harvested data. For example, a web created using a content management system might offer a site map containing URLs of all documents within the site. Or there can be a sequence of numbers assigned to all documents on the site. In such cases, one can develop a script tailored for downloading from the particular web site reaching a higher efficiency than the level achieved by a general web crawler. An analysis will be carried out to identify web domains allowing such semi-automated approach of obtaining data after the web crawl is done. That will lead to higher corpora sizes for languages with scarce presence in the internet.

Resources Available at the Beginning of the Project

Corpora Built by Text Laboratory, University of Oslo

Amharic WIC corpus, 200 thousand tokens News from Walta Information Center, manually tagged.
Oromo spoken corpus, 7,500 tokens. Oromo spoken corpus containing 1205 utterances.

Wordlists at crubadan.org

Amharic: http://crubadan.org/languages/am
Oromo: http://crubadan.org/languages/om
Somali: http://crubadan.org/languages/so
Tigrinya: http://crubadan.org/languages/ti

Wikipedia articles

Amharic: https://am.wikipedia.org/wiki/ (~13,000 articles)
Oromo: https://om.wikipedia.org/wiki/ (~680 articles)
Somali: https://so.wikipedia.org/wiki/ (~3,700 articles)
Tigrinya: https://ti.wikipedia.org/wiki/ (~160 articles)

Last modified 10 years ago Last modified on Feb 1, 2016, 4:59:14 PM

Download in other formats:

Plain Text