Context Navigation

Changes between Version 11 and Version 12 of ParallelCzechNorwegian

v11	v12
5	5	The source for this corpus was taken from !OpenSubtitles corpus made available within [[http://opus.lingfil.uu.se/OpenSubtitles2016.php\|OPUS2 parallel corpus]] [2] and originally taken from
6	6	[http://www.opensubtitles.org/].
	7	As many subtitles are very old, encoding must be determined automatically, various formats must be converted into XML, text segmented into sentences and tokens. OCR errors (many subtitles are automatically extracted from video streams) were corrected using noisy-channel principle with language models trained on Google N-grams data. Metadata (info about films, date of creating the subtitles etc.) are not preserved in the final parallel corpus. Document-level alignment were done using heuristic function, sentence(segment)-level alignment was done using time overlaps of intervals of subtitles.
7	8
8	9