5 | | [[BR]]To build a large collection of texts from the web, one needs to master the following general steps: |
| 5 | To build a large collection of texts from the web, one needs to master the following general steps: |
| 6 | * Identify suitable documents to obtain. In the case of this project, all documents containing text in target languages are suitable. |
| 7 | * Download the selected data from the internet, keeping important metadata such as the source URL and the date of acquisition. |
| 8 | * Process the obtained data by stripping off non textual parts, clearing away boilerplate and unwanted text parts, removing duplicate parts, and other possible methods to get quality data in the result. |
| 9 | * Store the result in a way enabling access according to the desired purpose, reusability and ability to process the raw internet data again. The usual way to store and access a corpus is to employ a corpus manager. |
7 | | * Identify suitable documents to obtain. |
8 | | |
9 | | |
10 | | * Download the selected data from the internet, keeping important metadata such as the source URL and the date of acquisition. |
11 | | |
12 | | |
13 | | * Process the obtained data by stripping off non textual parts, clearing away boilerplate and unwanted text parts, removing duplicate parts, and other possible methods to get quality data in the result. |
14 | | |
15 | | |
16 | | * Store the result in a way enabling access according to the desired purpose, reusability and ability to process the raw internet data again. |
17 | | |
18 | | |
19 | | |
20 | | == Obtaining Text Data from the Web == |
21 | | There are billions of documents available on the web. The process of traversing the web and downloading data (crawling) is a time and resource consuming task. A web crawler is a piece of software made for the task of crawling the internet. The crawler is usually initialized by a set of starting internet points, the seed URLs. It downloads each document from the initial set, extracts links to other documents from the data and continues its work with the discovered set of new URLs. |
22 | | |
23 | | [[BR]]Running a one-time web crawling, if the same URL is discovered again, it is considered as duplicate and discarded. Text data is collected once in this project. Repeated crawls to keep multiple versions of the same web pages are not required. A `snapshot' of a part of the web in the target languages is created. |
24 | | |
25 | | [[BR]]The crawling strategy – making decisions which parts of the web to explore first, i.e. which documents to download immediately and which postpone for obtaining later – is a very important factor in the design of a successful crawler. The implemented traversing algorithm is crucial in achieving wide coverage of web domains or higher crawling effectivity; higher amount of extracted data or catching `important' web pages, whichever is a priority. !SpiderLing, the crawler enhanced and used within this project, focuses on downloading documents from web domains yielding much text in target languages. |
26 | | |
27 | | [[BR]]Additional issues have to be taken into account when crawling the web: not overusing the source servers by obeying the Robots exclusion protocol, boilerplate removal and content de-duplication (if desired), robust postprocessing of crawled data (e.g. dealing with malformed data, language detection, character encoding detection). |
28 | | |
29 | | [[BR]] |
30 | | == Target languages == |
| 11 | = Target Languages = |
43 | | [[BR]]In the case of the African languages, the crawler was constrained to downloading from top level domains et, er, so, dj to improve the efficiency of the process. Generic domains com, org, net, gov, info, edu – frequent in seed URLs – were allowed to improve the recall. |
| 23 | = Obtaining Text Data from the Web = |
| 24 | There are billions of documents available on the web. The process of traversing the web and downloading data (crawling) is a time and resource consuming task. A web crawler is a piece of software made for the task of crawling the internet. The crawler is usually initialized by a set of starting internet points, the seed URLs. It downloads each document from the initial set, extracts links to other documents from the data and continues its work with the discovered set of new URLs. |
63 | | [[BR]]We took the same approach: previous crawls (in the case of Norwegian and Czech) as well as URLs suggested by search engine Bing were used to initialise the crawler. As for the African languages, the process follows: |
| 37 | We took the same approach: previous crawls (in the case of Norwegian and Czech) as well as URLs suggested by search engine Bing were used to initialise the crawler. As for the African languages, the process follows: |
| 38 | * Bigrams of words from An Crúbadán were sorted by frequency. |
| 39 | * Items at from rank 200 to rank 1100 were filtered for whole words (e.g. punctuation and end of lines were removed). |
| 40 | * The list was manually filtered by native speakers. |
| 41 | * The clean lists (300 to 500 pairs of words) were used to query the search engine. |
| 42 | * The result URLs were employed as starting points for the crawler. |
65 | | * Bigrams of words from An Crúbadán were sorted by frequency. |
| 44 | || Language || Filtered word bigrams || Seed URLs found by the search engine || Most frequent web domains in seed URLs yielding text || |
| 45 | || Amharic || 354 || 6,905 || am.wikipedia.org, www.zehabesha.com, andadirgen.blogspot.com, plus.google.com, wol.jw.org, www.dejeselam.org, ethsat.com || |
| 46 | || Oromo || 366 || 9,843 || www.voaafaanoromoo.com, www.youtube.com, finfinnetribune.com, gadaa.com, www.bilisummaa.com, www.oromoliberationfront.org, qeerroo.org || |
| 47 | || Somali || 432 || 18,087 || www.youtube.com, salaanmedia.com, waajid.wordpress.com, so.wikipedia.org, www.ogadennet.com, www.bbc.com, geeska.net || |
| 48 | || Tigrinya || 424 || 9,007 || assenna.com, www.youtube.com, tigrigna.voanews.com, www.jeberti.com, www.betezion.com, demhitonline.blogspot.com, www.gereger.com || |
86 | | [[BR]]A boilerplate consists in parts repeated on many web pages within a web site, e.g. header, navigation menus, footer, advertisement, lists of links, decorative elements, snippets of related articles. Boilerplate text distort the statistics (e.g. the distributional similarity information) of the corpus, the count of repeated terms gives biased information about the language and makes the corpus search provide no useful evidence about the phenomenon being investigated. |
| 55 | A boilerplate consists in parts repeated on many web pages within a web site, e.g. header, navigation menus, footer, advertisement, lists of links, decorative elements, snippets of related articles. Boilerplate text distort the statistics (e.g. the distributional similarity information) of the corpus, the count of repeated terms gives biased information about the language and makes the corpus search provide no useful evidence about the phenomenon being investigated. |
91 | | == HaBiT corpus creation pipeline == |
92 | | To summarise, the data retrieval software and text cleaning tools used within the corpus creation pipeline in this project are: |
93 | | * WebBootCaT, a search engine querying and corpus building tool. |
94 | | * SpiderLing, a web crawler for linguistic purpose. |
95 | | * Character encoding detection using a byte trigram model. |
96 | | * Language identification using a wordlist and a character trigram model. |
97 | | * Justext, a boilerplate removal tool using a heuristic algorithm. |
98 | | * Onion, a de-duplication tool removing both identical and similar documents, on the paragraph level. |
| 61 | = HaBiT Corpus Creation Pipeline = |
| 62 | The current data retrieval software and text cleaning tools suitable for use within the corpus creation pipeline in this project are: |
| 63 | 1. WebBootCaT, a search engine querying and corpus building tool. It queries a search engine with user specified words or phrases, obtains URLs of relevant documents fount by the search engine and downloads the documents. |
| 64 | 1. Spiderling is a web spider for linguistics. It can crawl text-rich parts of the web and collect a lot of data suitable for text corpora. |
| 65 | 1. Chared is a tool for detecting the character encoding of a text in a known language. The language of the text has to be specified as an input parameter so that correspondent language model can be used. The package contains models for a wide range of languages. In general, it should be more accurate than character encoding detection algorithms with no language constraints. |
| 66 | 1. Justext is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. A HTML page is split to paragraphs and a context sensitive heuristic algorithm is employed to separate content from boilerplate. |
| 67 | 1. Onion is a tool for removing duplicate parts from large collections of texts. One instance of duplicate parts is kept, others are marked or removed. The tool allows removing both identical and similar documents, on any level (document, paragraph, sentence — if present in the data). The de-duplication algorithm is based on comparing hashes of n-grams of words. |
| 68 | 1. Unitok is a universal text tokeniser with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens (vertical format), while preserving XML-like tags containing metadata. |
100 | | Notes: The models used in the process were trained on subsets of documents found by the search engine in the target languages. Malformed data (e.g. wrong spelling of words) is not dealt with. |
| 70 | = Obtaining Low Resourced Language Resources = |
| 71 | Since African languages dealt with in the project are low resourced and not much represented on the web, the corpus creation pipeline has to be adapted accordingly. |
| 72 | |
| 73 | The crawler will be modified to improve yield of the process for low resourced languages: |
| 74 | * The crawler will be constrained to downloading from top level domains et, er, so, dj to improve the efficiency of the process. Generic domains com, org, net, gov, info, edu – frequent in seed URLs – will be allowed to improve the recall. |
| 75 | * The traversing strategy allows continuous downloading from web domains yielding a small amount of text. |
| 76 | * The scheduler does not limit the number of documents downloaded from a single web domain. |
| 77 | * The crawler gathered documents from all target national domains in a single run. The documents were split to paragraphs. Every paragraph was separated according to its language. This way documents containing paragraphs in various languages were used the optimal way. |
| 78 | |
| 79 | The models used in the process can be trained on texts from Wikipedia or on subsets of documents found by the search engine for all six target languages. There are two factors contributing to discerning target African languages thus making the corpus creation pipeline less accurate: |
| 80 | 1. Some language are spoken in several countries (top level domains). |
| 81 | 1. There are multiple languages spoken in a country. |
| 82 | To mitigate the risk of determining the language of a text incorrectly, it is important to carefully check data used for building language dependent models. |
| 83 | |
| 84 | Web domains rich in text documents are worth analysing of the structure of their content. That might increase the amount of harvested data. For example, a web created using a content management system might offer a site map containing URLs of all documents within the site. Or there can be a sequence of numbers assigned to all documents on the site. In such cases, one can develop a script tailored for downloading from the particular web site reaching a higher efficiency than the level achieved by a general web crawler. That would lead to higher corpora sizes for languages with scarce presence in the internet. |