Changes between Version 3 and Version 4 of NorwegianCorpus
- Timestamp:
- Jan 18, 2017, 10:50:36 AM (7 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
NorwegianCorpus
v3 v4 9 9 - 1756 most frequent Norwegian (Bokmål) words from the texts obtained in previous years were used as a wordlist to check the language of a running text by boilerplate removal tool jusText [3]. 10 10 11 The crawler was set to harvest web domains in national top level domainNorway (no) and other general TLDs (eu, com, org, net, gov, info, edu).11 The crawler was set to harvest web domains in the national top level domain of Norway (no) and other general TLDs (eu, com, org, net, gov, info, edu). 12 12 13 13 487 GB of http responses was gathered in the process. HTML tags and boilerplate paragraphs were removed from the raw data by jusText. Then, Norwegian texts obtained by similar means in 2015 and 2011 were added. Duplicate or near duplicate paragraphs were identified and removed using tool onion [3]. The final size of the corpus is 29 GB and 4 billion tokens.