Changes between Version 1 and Version 2 of NorwegianCorpus


Ignore:
Timestamp:
Jan 18, 2017, 10:40:50 AM (8 years ago)
Author:
xsuchom2
Comment:

Created

Legend:

Unmodified
Added
Removed
Modified
  • NorwegianCorpus

    v1 v2  
    11= D2.2: A Norwegian corpus, sized 5 billion words =
     2
     3== Building the Norwegian Web corpus ==
     4We have used crawler !SpiderLing [1] to crawl Norweian sites on the Web. URLs of documents obtained in previous years by adopting the Corpus factory method [2] were used as starting points for the crawler.
     5
     6The following language models were created:
     7 - Character trigram model for language detection. 129 KB of text from documents obtained in previous years and manually checked was used to train the model.
     8 - Byte trigram model for character encoding detection. The model was trained using web pages obtained by the Corpus factory method.
     9 - 1756 most frequent Norwegian (Bokmål) words from the texts obtained in previous years were used as a wordlist to check the language of a running text by boilerplate removal tool jusText [3].
     10
     11The crawler was set to harvest web domains in national top level domain Norway (no) and other general TLDs (eu, com, org, net, gov, info, edu).
     12
     13487 GB of http responses was gathered in the process. HTML tags and boilerplate paragraphs were removed from the raw data by jusText. Then, Norwegian texts obtained by similar means in 2015 and 2011 were added. Duplicate or near duplicate paragraphs were identified and removed using tool onion [3]. The final size of the corpus is 29 GB and 4 billion tokens.
     14
     15== Corpus properties ==
     16Basic properties of corpus sources are summarised below.
     17
     18The size of corpus structures:
     19||=Document count    =||    14,476,061||
     20||=Paragraph count   =||   102,141,296||
     21||=Token count       =|| 3,986,246,932||
     22||=Word form lexicon size =||    24,167,191||
     23
     24Document count – the most frequent web domains and domain size distribution:
     25||||= Top level domains =||||= Web domains =||||= Second level domain size distribution =||
     26||no   || 9,127,584|| blogg.no             || 349,462||At least 10,000 documents ||     153||
     27||com  || 3,668,518|| blogspot.no          || 341,863||At least  1,000 documents ||   2,776||
     28||net  ||   567,790|| kommune.no           || 185,195||At least    100 documents ||  14,878||
     29||org  ||   373,856|| co.no                    ||  79,645||At least     10 documents ||  52,196||
     30||info ||   179,650|| uio.no               ||  66,466||At least      1 documents || 118,615||
     31|| ||              || diplotop.com             ||  61,918|| || ||
     32|| ||              || blogspot.com             ||  51,464|| || ||
     33|| ||              || vgb.no               ||  46,361|| || ||
     34
     35The most frequent words:
     36||=Word (Latin script) =||= Count =||
     37||og    || 104,481,495||
     38||i     || 87,921,748||
     39||er    || 68,176,093||
     40||på    || 55,775,756||
     41||det   || 54,210,341||
     42||som   || 53,329,594||
     43||en    || 50,226,498||
     44||til   || 48,999,052||
     45||å     || 43,671,334||
     46||med   || 43,099,278||
     47||av    || 43,070,779||
     48||for   || 42,796,768||
     49||at    || 35,979,998||
     50||har   || 34,762,626||
     51||de    || 25,516,989||
     52||ikke  || 24,923,211||
     53||den   || 23,416,684||
     54||om    || 21,264,735||
     55||jeg   || 20,844,747||
     56||kan   || 19,317,571||
     57
     58== Corpus query interface ==
     59The corpus has been indexed by corpus manager and query system Sketch Engine [4]. The corpus can be searched at http://corpora.fi.muni.cz/habit/run.cgi/first_form?corpname=notenten17_0.
     60
     61== References ==
     62 - [1] -- Suchomel, Vít, and Jan Pomikálek. "Efficient web crawling for large text corpora." In Proceedings of the seventh Web as Corpus Workshop (WAC7), pp. 39-43. 2012.
     63 - [2] -- Kilgarriff, Adam, Siva Reddy, Jan Pomikálek, and P. V. S. Avinesh. "A Corpus Factory for Many Languages." In LREC. 2010.
     64 - [3] -- Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Disertační práce, Masarykova univerzita, Fakulta informatiky (2011).
     65 - [4] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.