Changes between Version 3 and Version 4 of NorwegianCorpus


Ignore:
Timestamp:
Jan 18, 2017, 10:50:36 AM (7 years ago)
Author:
xsuchom2
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • NorwegianCorpus

    v3 v4  
    99 - 1756 most frequent Norwegian (Bokmål) words from the texts obtained in previous years were used as a wordlist to check the language of a running text by boilerplate removal tool jusText [3].
    1010
    11 The crawler was set to harvest web domains in national top level domain Norway (no) and other general TLDs (eu, com, org, net, gov, info, edu).
     11The crawler was set to harvest web domains in the national top level domain of Norway (no) and other general TLDs (eu, com, org, net, gov, info, edu).
    1212
    1313487 GB of http responses was gathered in the process. HTML tags and boilerplate paragraphs were removed from the raw data by jusText. Then, Norwegian texts obtained by similar means in 2015 and 2011 were added. Duplicate or near duplicate paragraphs were identified and removed using tool onion [3]. The final size of the corpus is 29 GB and 4 billion tokens.