wiki:SpiderLingImprovement

D2.1: An improvement of web crawler SpiderLing

SpiderLing — a web spider for linguistics

SpiderLing — a web spider for linguistics — is a software for obtaining text from the web useful for building text corpora. [1] Many documents on the web only contain material not suitable for text corpora, such as site navigation, lists of links, lists of products, and other kind of text not comprised of full sentences. In fact such pages represent the vast majority of the web. Therefore, by doing unrestricted web crawls, we typically download a lot of data which gets filtered out during post-processing. This makes the process of web corpus collection inefficient. The crawler focuses on text rich parts of the web to maximize the number of words in the final corpus per downloaded megabyte. It can be configured to ignore the yield rate of web domains and download from low yield sites too.

SpiderLing improvements relevant for the HaBiT project

The most relevant improvement of the crawler for the HaBiT project is crawling multiple languages. The crawler now supports recognising multiple languages. A web page is converted to UTF-8 and compared to character trigrams characteristic for all recognised languages. The similarity of trigram vectors is calculated to determine the nearest language. The most similar language from the target languages (a subset of resognised languages specified by the user) meeting a similarity threshold is selected.

The following language models are required for each language to recognise:

  • Character trigram model for language detection. The model is built by the crawler from clean texts supplied by the user. The texts may come from online newspapers or Wikipedia. All non-words or foreign words should be removed to increase the accurracy of the result model.
  • Byte trigram model for character encoding detection (in case other encodings than UTF-8 are used on the web for the given language). The model can be trained on sample web pages in various encodings. The Corpus factory method [2] can be used to get the web pages.
  • Wordlist of frequent words. At least 250 most frequent words in the language is required. The texts used to build the trigram model can be used.

This technique is useful for crawling web domains in countries or larger areas where multiple languages are used. For example, all four languages of interest in the HaBiT project are spoken in Ethiopia: Oromo (34 % of the population), Amharic (29 %), Somali (6 %) and Tigrinya (6 %). [3] A similar situation may be observed e.g. in Nigeria: Hausa and the Fulani (29 %), Yoruba (21 %), Igbo (18 %). [4] Therefore this improvement was quite benefitial for the HaBiT project.

The improved crawler has been employed to harvest web domains in national top level domains of Ethiopia, Eritrea, Somalia, Djibouti (et, er, so, dj) and other general TLDs (com, org, net, gov, info, edu) altogether in January 2016. 42 GB of http responses was gathered in the process. HTML tags and boilerplate paragraphs were removed from the raw data. The plain texts were separated by the target languages using the language models.

Another improvement of the crawler relevant for the HaBiT project is tracking the distance of web domains from the seed web domains. A text corpus built for computational linguistics purpose should contain fluent, meaningful, natural sentences in the desired language. However, some webs made for spamming purposes (e.g. increasing the search engine rank) do not keep these properties by introducing non-words and incoherent sentences in the corpus thus hindering the quality of the corpus and leading to skewed language analyses based on the corpus.

The web topology oriented techniques perceive the web as a graph of web pages interconnected by links. observed that "linked hosts tend to belong to the same class: either both are spam or both are non-spam". [5] Another work proposes a web page assessment algorithm TrustRank [6] based on trusted sources manually identified by experts. The set of such reputable pages was used as seeds for a web crawl. The link structure of the web helped to discover other pages that are likely to be good.

The improvement of SpiderLing is similar to the TrustRank method. First, the seed URLs are taken from trusted and relevant sources: lists of web newspapers, government portals, big institution sites. These URLs are taken from specialised lists, e.g. http://www.onlinenewspapers.com/, http://www.dmoz.org/, http://urlblacklist.com/ and manually selected of URLs from Wikipedia in the target language. All web domains in the source URLs have zero distance. A new domain discovered in the process of crawling is assigned a distance greater by one than the distance of the domain linking the new domain. The distances are updated during the crawling. Finally, the result corpus can be built from domains in the distance from the trusted domains up to a certain threshold. This should lead to better quality of the result corpus.

Using this technique is advisable for language with a large Web presence (not for scare resourced languages) since it may omit useful texts just because of a large distance of the domain from the seeds.

Overview of major improvements of SpiderLing made during the HaBiT project

  • Crawling multiple languages: recognise multiple languages, accept a subset of these languages.
  • Tracking the distance of web domains from the seed web domains.
  • Gathering more document attributes: IP address, detected language, declared/detected character encoding.
  • More data extracted by Justext configuration tweaking and with the help of clean rather than large text sources of language models.
  • Support for processing doc/docx/ps/pdf formats of data.
  • Strict resource locking to prevent multithreading issues when operating with shared data structures.
  • Better performance (more pages downloaded per second) and less resources used (approx. 25 % less operational memory consumed) achieved by better spreading of domains in the crawling queue, switching to PyPy from Python (the script is compiled before execution instead of interpreting during execution), rewriting chunked HTTP reponse and URL handling methods and generally improving the code overall.

SpiderLing crawler documentation and source

http://corpus.tools/wiki/SpiderLing

References

  • [1] Suchomel, Vít, and Jan Pomikálek. "Efficient web crawling for large text corpora." In Proceedings of the seventh Web as Corpus Workshop (WAC7), pp. 39-43. 2012.
  • [2] Kilgarriff, Adam, Siva Reddy, Jan Pomikálek, and P. V. S. Avinesh. "A Corpus Factory for Many Languages." In LREC. 2010.
  • [3] https://www.cia.gov/library/publications/the-world-factbook/geos/et.html
  • [4] https://www.cia.gov/library/publications/the-world-factbook/geos/ni.html
  • [5] Castillo, Carlos, Debora Donato, Aristides Gionis, Vanessa Murdock, and Fabrizio Silvestri. "Know your neighbors: Web spam detection using the web topology." In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 423-430. ACM, 2007.
  • [6] Gyöngyi, Zoltán, Hector Garcia-Molina, and Jan Pedersen. "Combating web spam with trustrank." In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30, pp. 576-587. VLDB Endowment, 2004.
Last modified 7 years ago Last modified on Jan 17, 2017, 11:59:13 PM