Context Navigation

D2.2: A Norwegian corpus, sized 5 billion words

Building the Norwegian Web corpus

We have used crawler SpiderLing [1] to crawl Norwegian sites on the Web. URLs of documents obtained in previous years by adopting the Corpus factory method [2] were used as starting points for the crawler.

The following language models were created:

Character trigram model for language detection. 129 KB of text from documents obtained in previous years and manually checked was used to train the model.
Byte trigram model for character encoding detection. The model was trained using web pages obtained by the Corpus factory method.
1756 most frequent Norwegian (Bokmål) words from the texts obtained in previous years were used as a wordlist to check the language of a running text by boilerplate removal tool jusText [3].

The crawler was set to harvest web domains in the national top level domain of Norway (no) and other general TLDs (eu, com, org, net, gov, info, edu).

487 GB of http responses was gathered in the process. HTML tags and boilerplate paragraphs were removed from the raw data by jusText. Then, Norwegian texts obtained by similar means in 2015 and 2011 were added. Duplicate or near duplicate paragraphs were identified and removed using tool onion [3]. The final size of the corpus is 29 GB and 4 billion tokens.

Corpus properties

Basic properties of corpus sources are summarised below.

The size of corpus structures:

Document count	14,476,061
Paragraph count	102,141,296
Token count	3,986,246,932
Word form lexicon size	24,167,191

Document count – the most frequent web domains and domain size distribution:

Top level domains		Web domains		Second level domain size distribution
no	9,127,584	blogg.no	349,462	At least 10,000 documents	153
com	3,668,518	blogspot.no	341,863	At least 1,000 documents	2,776
net	567,790	kommune.no	185,195	At least 100 documents	14,878
org	373,856	co.no	79,645	At least 10 documents	52,196
info	179,650	uio.no	66,466	At least 1 documents	118,615
		diplotop.com	61,918
		blogspot.com	51,464
		vgb.no	46,361

The most frequent words:

Word (Latin script)	Count
og	104,481,495
i	87,921,748
er	68,176,093
på	55,775,756
det	54,210,341
som	53,329,594
en	50,226,498
til	48,999,052
å	43,671,334
med	43,099,278
av	43,070,779
for	42,796,768
at	35,979,998
har	34,762,626
de	25,516,989
ikke	24,923,211
den	23,416,684
om	21,264,735
jeg	20,844,747
kan	19,317,571

Corpus query interface

The corpus has been indexed by corpus manager and query system Sketch Engine [4]. The corpus can be searched at http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=notenten17_0.

References

[1] -- Suchomel, Vít, and Jan Pomikálek. "Efficient web crawling for large text corpora." In Proceedings of the seventh Web as Corpus Workshop (WAC7), pp. 39-43. 2012.
[2] -- Kilgarriff, Adam, Siva Reddy, Jan Pomikálek, and P. V. S. Avinesh. "A Corpus Factory for Many Languages." In LREC. 2010.
[3] -- Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Disertační práce, Masarykova univerzita, Fakulta informatiky (2011).
[4] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.

Last modified 9 years ago Last modified on Jan 18, 2017, 10:50:36 AM

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text