| 7 | As many subtitles are very old, encoding must be determined automatically, various formats must be converted into XML, text segmented into sentences and tokens. OCR errors (many subtitles are automatically extracted from video streams) were corrected using noisy-channel principle with language models trained on Google N-grams data. Metadata (info about films, date of creating the subtitles etc.) are not preserved in the final parallel corpus. Document-level alignment were done using heuristic function, sentence(segment)-level alignment was done using time overlaps of intervals of subtitles. |