Changes between Version 11 and Version 12 of ParallelCzechNorwegian


Ignore:
Timestamp:
Jan 17, 2017, 9:17:14 PM (7 years ago)
Author:
xbaisa
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • ParallelCzechNorwegian

    v11 v12  
    55The source for this corpus was taken from !OpenSubtitles corpus made available within [[http://opus.lingfil.uu.se/OpenSubtitles2016.php|OPUS2 parallel corpus]] [2] and originally taken from
    66[http://www.opensubtitles.org/].
     7As many subtitles are very old, encoding must be determined automatically, various formats must be converted into XML, text segmented into sentences and tokens. OCR errors (many subtitles are automatically extracted from video streams) were corrected using noisy-channel principle with language models trained on Google N-grams data. Metadata (info about films, date of creating the subtitles etc.) are not preserved in the final parallel corpus. Document-level alignment were done using heuristic function, sentence(segment)-level alignment was done using time overlaps of intervals of subtitles.
    78
    89