wiki:CzechCorpus

D2.3: A new Czech corpus, sized 10 billion words

Building the Czech Web corpus

We have used crawler SpiderLing [1] to crawl Czech sites on the Web. URLs of documents obtained in previous years by adopting the Corpus factory method [2] were used as starting points for the crawler.

The following language models were created:

  • Character trigram model for language detection. 59 KB of text from documents obtained in previous years and manually checked was used to train the model.
  • Byte trigram model for character encoding detection. The model was trained using web pages obtained by the Corpus factory method.
  • 2659 most frequent Czech words from the texts obtained in previous years were used as a wordlist to check the language of a running text by boilerplate removal tool jusText [3].
  • These models were also created for Slovak and English to filter out unwanted language content from the corpus data while crawling the Web.

The crawler was set to harvest web domains in the national top level domain of the Czech Republic (cz) and other general TLDs (eu, com, org, net, gov, info, edu).

2,117 GB of http responses was gathered in the process. HTML tags and boilerplate paragraphs were removed from the raw data by jusText. Then, Czech texts obtained by similar means in 2015 were added. Duplicate or near duplicate paragraphs were identified and removed using tool onion [3]. The final size of the corpus is 80 GB and 9.3 billion tokens.

Corpus properties

Basic properties of corpus sources are summarised below.

The size of corpus structures:

Document count 32,739,545
Paragraph count 185,942,419
Sentence count 591,149,939
Token count 9,307,153,731

Corpus query interface

The corpus has been indexed by corpus manager and query system Sketch Engine [4]. The corpus can be searched at http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=cztenten16_0.

References

  • [1] -- Suchomel, Vít, and Jan Pomikálek. "Efficient web crawling for large text corpora." In Proceedings of the seventh Web as Corpus Workshop (WAC7), pp. 39-43. 2012.
  • [2] -- Kilgarriff, Adam, Siva Reddy, Jan Pomikálek, and P. V. S. Avinesh. "A Corpus Factory for Many Languages." In LREC. 2010.
  • [3] -- Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Disertační práce, Masarykova univerzita, Fakulta informatiky (2011).
  • [4] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.
Last modified 7 years ago Last modified on Jan 18, 2017, 10:56:06 AM