Changes between Version 4 and Version 5 of CorporaAndCorpusBuilding
- Timestamp:
- Jan 27, 2016, 6:08:13 PM (9 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
CorporaAndCorpusBuilding
v4 v5 19 19 || Tigrinya || Eritrea, Ethiopia || et, er || Ge’ez script || 20 20 21 Unlike Norwegian and Czech, the African languages in this set can be considered low resourced – there are no large corpora available. Also, the presence of texts in Amharic, Oromo and Tigrinya is quite limited.21 Unlike Norwegian and Czech, the African languages in this set can be considered low resourced – there are no large corpora available. Also, the presence of texts in Amharic, Oromo and Tigrinya on the web is quite limited. 22 22 23 23 = Obtaining Text Data from the Web = … … 28 28 The crawling strategy – making decisions which parts of the web to explore first, i.e. which documents to download immediately and which postpone for obtaining later – is a very important factor in the design of a successful crawler. The implemented traversing algorithm is crucial in achieving wide coverage of web domains or higher crawling effectivity; higher amount of extracted data or catching `important' web pages, whichever is a priority. SpiderLing, the crawler enhanced and used within this project, focuses on downloading documents from web domains yielding much text in target languages. 29 29 30 Additional issues have to be taken into account when crawling the web: not overusing the source servers by obeying the Robots exclusion protocol, boilerplate removal and content de-duplication (if desired), robust postprocessing of crawled data (e.g. dealing with malformed data, language detection, character encoding detection).30 Additional issues have to be taken into account when crawling the web: not overusing the source servers by obeying the Robots exclusion protocol, boilerplate removal and content de-duplication (if desired), robust postprocessing of crawled data (e.g., language detection, character encoding detection, dealing with malformed data if needed). 31 31 32 32 = Obtaining Seed URLs = 33 Starting the crawl with a good, text yielding and trustworthy sources can positively benefit the result corpus. Exploiting web search engines a way to identify relevant documents in the web. The search engine is expected to supply good text data in the desired language based on search parameters.33 Starting the crawl with a good, text yielding and trustworthy sources can positively benefit the result corpus. Exploiting web search engines is a way to identify relevant documents in the web. The search engine is expected to supply good text data in the desired language based on search parameters. 34 34 35 WebBootCaT is a tool for bootstrapping corpora and terms from the web (an extension of a method devised by Baroni and a module in SketchEngine corpus query system nowadays). It allows quick and effortless focused web corpus building. A similar approach in a much larger scale was used later by the ClueWeb project. It started with two types of seed URLs: one from an earlier 200 million page crawl, another given by commercial search engines. The search engines were queried using most frequent queries and random word queries for each target language.35 WebBootCaT is a tool for bootstrapping corpora and terms from the web (an extension of a method devised by Baroni and a module embedded in SketchEngine corpus query system). It allows quick and effortless focused web corpus building. A similar approach in a much larger scale was used later by the ClueWeb project. It started with two types of seed URLs: one from an earlier 200 million page crawl, another given by commercial search engines. The search engines were queried using most frequent queries and random word queries for each target language. 36 36 37 37 We took the same approach: previous crawls (in the case of Norwegian and Czech) as well as URLs suggested by search engine Bing were used to initialise the crawler. As for the African languages, the process follows: … … 53 53 In the first phase, the document language and character encoding have to be identified. In the case of a single target language, data in other languages are stripped off. The document encoding can be normalized to UTF-8 which is the most spread encoding standard capable of addressing all necessary character codepoints. 54 54 55 A boilerplate consists in parts repeated on many web pages within a web site, e.g. header, navigation menus, footer, advertisement, lists of links, decorative elements, snippets of related articles. Boilerplate text distort the statistics (e.g. the distributional similarity information) of the corpus, the count of repeated terms gives biased information about the language and makes the corpus search provide no useful evidence about the phenomenon being investigated. 55 A boilerplate consists in parts repeated on many web pages within a web site, e.g. header, navigation menus, footer, advertisement, lists of links, decorative elements, snippets of related articles. Boilerplate text distort the statistics (e.g. the distributional similarity information) of the corpus, the count of repeated terms gives biased information about the language and makes the corpus search provide no useful evidence about the phenomenon being investigated. Therefore it is necessary to remove boilerplate from the corpus data. 56 56 57 57 Another important issue is duplicity. The digital information on the web is easily copied and thus documents may have significant overlaps. That might happen in the case of multiple online newspapers sharing the same announcement released by a press agency, with document revisions, or quotations of previous posts in online discussion forums. Removing duplicate and near-duplicate texts is therefore essential to avoid e.g. duplicate concordance lines and to prevent biased results derived from statistical processing of corpus data suffering from artificially inflated frequencies of some words and expressions. We adhere to removing of paragraphs consisting of more than 50 % word 7-tuples encountered in previously processed data. 58 58 59 Wrong spelling of words and other kinds of malformed data are not dealt with. 59 Filtering out paragraphs in a language similar (but not identical) to the desired language of a document might be necessary in the case of Norwegian (similar to Danish) and Czech (similar to Slovak). It can be achieved by counting words common in the target language but not present in the similar language (accepted words) and vice versa (rejected words). Only paragraphs containing much more accepted words than rejected words should be kept. 60 61 Wrong spelling of words and other kinds of malformed data are not dealt with in this project. 60 62 61 63 = HaBiT Corpus Creation Pipeline = 62 64 The current data retrieval software and text cleaning tools suitable for use within the corpus creation pipeline in this project are: 63 1. WebBootCaT, a search engine querying and corpus building tool. It queries a search engine with user specified words or phrases, obtains URLs of relevant documents foun tby the search engine and downloads the documents.65 1. WebBootCaT, a search engine querying and corpus building tool. It queries a search engine with user specified words or phrases, obtains URLs of relevant documents found by the search engine and downloads the documents. 64 66 1. Spiderling is a web spider for linguistics. It can crawl text-rich parts of the web and collect a lot of data suitable for text corpora. 65 1. Chared is a tool for detecting the character encoding of a text in a known language. The language of the text has to be specified as an input parameter so that correspondent language model can be used. The package contains models for a wide range of languages. In general, it should be more accurate than character encoding detection algorithms with no language constraints.67 1. Chared is a tool for detecting the character encoding of a text in a known language. The language of the text has to be specified as an input parameter so that a correspondent language model can be used. The package contains models for a wide range of languages. In general, it should be more accurate than character encoding detection algorithms with no language constraints. 66 68 1. Justext is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. A HTML page is split to paragraphs and a context sensitive heuristic algorithm is employed to separate content from boilerplate. 67 69 1. Onion is a tool for removing duplicate parts from large collections of texts. One instance of duplicate parts is kept, others are marked or removed. The tool allows removing both identical and similar documents, on any level (document, paragraph, sentence — if present in the data). The de-duplication algorithm is based on comparing hashes of n-grams of words. … … 71 73 Since African languages dealt with in the project are low resourced and not much represented on the web, the corpus creation pipeline has to be adapted accordingly. 72 74 73 The crawler will be modified to improve yield of the process for low resourced languages:75 * The crawler will be modified to improve the yield of the process for low resourced languages: 74 76 * The crawler will be constrained to downloading from top level domains et, er, so, dj to improve the efficiency of the process. Generic domains com, org, net, gov, info, edu – frequent in seed URLs – will be allowed to improve the recall. 75 * The traversing strategy allowscontinuous downloading from web domains yielding a small amount of text.76 * The scheduler doesnot limit the number of documents downloaded from a single web domain.77 * The crawler gathered documents from all target national domains in a single run. The documents were split to paragraphs. Every paragraph was separated according to its language. This way documents containing paragraphs in various languages were used the optimal way.77 * The traversing strategy will allow continuous downloading from web domains yielding a small amount of text. 78 * The scheduler will not limit the number of documents downloaded from a single web domain. 79 * The crawler will gather documents from all target national domains in a single run. The documents were split to paragraphs. Every paragraph was separated according to its language. This way documents containing paragraphs in various languages can be used the optimal way. 78 80 79 The models used in the process can be trained on texts from Wikipedia or on subsets of documents found by the search engine for all six target languages. There are two factors contributing to discerning target African languages thus makingthe corpus creation pipeline less accurate:80 1. Some language are spoken in several countries (top level domains).81 1. There are multiple languages spoken in a country.81 The models used in the process will be trained on texts from Wikipedia or on subsets of documents found by the search engine for all six target languages. There are two factors contributing to discerning target African languages that make the corpus creation pipeline less accurate: 82 1. Some languages are spoken in several countries (top level domains). 83 2. There are multiple languages spoken in a country. 82 84 To mitigate the risk of determining the language of a text incorrectly, it is important to carefully check data used for building language dependent models. 83 85 84 Web domains rich in text documents are worth analysing of the structure of their content. That might increase the amount of harvested data. For example, a web created using a content management system might offer a site map containing URLs of all documents within the site. Or there can be a sequence of numbers assigned to all documents on the site. In such cases, one can develop a script tailored for downloading from the particular web site reaching a higher efficiency than the level achieved by a general web crawler. That wouldlead to higher corpora sizes for languages with scarce presence in the internet.86 Web domains rich in text documents are worth analysing of the structure of their content. That might increase the amount of harvested data. For example, a web created using a content management system might offer a site map containing URLs of all documents within the site. Or there can be a sequence of numbers assigned to all documents on the site. In such cases, one can develop a script tailored for downloading from the particular web site reaching a higher efficiency than the level achieved by a general web crawler. An analysis will be carried out to identify web domains allowing such semi-automated approach of obtaining data after the web crawl is done. That will lead to higher corpora sizes for languages with scarce presence in the internet.