Context Navigation

← Previous Change
Wiki History
Next Change →

close Warning: AdminModule failed with TracError: Unable to instantiate component <class 'trac.admin.web_ui.PluginAdminPanel'> (super(type, obj): obj must be an instance or subtype of type)

Changes between Version 3 and Version 4 of NorwegianCorpus

Timestamp:: Jan 18, 2017, 10:50:36 AM (9 years ago)
Author:: xsuchom2
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

NorwegianCorpus

v3	v4
9	9	- 1756 most frequent Norwegian (Bokmål) words from the texts obtained in previous years were used as a wordlist to check the language of a running text by boilerplate removal tool jusText [3].
10	10
11		The crawler was set to harvest web domains in ~~national top level domain~~ Norway (no) and other general TLDs (eu, com, org, net, gov, info, edu).
	11	The crawler was set to harvest web domains in the national top level domain of Norway (no) and other general TLDs (eu, com, org, net, gov, info, edu).
12	12
13	13	487 GB of http responses was gathered in the process. HTML tags and boilerplate paragraphs were removed from the raw data by jusText. Then, Norwegian texts obtained by similar means in 2015 and 2011 were added. Duplicate or near duplicate paragraphs were identified and removed using tool onion [3]. The final size of the corpus is 29 GB and 4 billion tokens.