= D2.4: Parallel Czech-Norwegian corpus, size 10 million tokens = == Source == The source for this corpus was taken from !OpenSubtitles corpus made available within [[http://opus.lingfil.uu.se/OpenSubtitles2016.php|OPUS2 parallel corpus]] [2] and originally taken from [http://www.opensubtitles.org/]. As many subtitles are very old, encoding must be determined automatically, various formats must be converted into XML, text segmented into sentences and tokens. OCR errors (many subtitles are automatically extracted from video streams) were corrected using noisy-channel principle with language models trained on Google N-grams data. Metadata (info about films, date of creating the subtitles etc.) are not preserved in the final parallel corpus. Document-level alignment were done using heuristic function, sentence(segment)-level alignment was done using time overlaps of intervals of subtitles. == Statistics == ||=Part=||=Tokens=||=Words=||=Segments=|| ||Czech|| 32,345,496|| 24,101,302|| 4,235,111|| ||Norwegian|| 32,549,746|| 25,503,941|| dtto|| == Examples == [[http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=czech_norwegian_opus__czech&reload=&iquery=Praha&queryselector=iqueryrow&lemma=&phrase=&word=&char=&cql=&default_attr=word&sel_aligned=czech_norwegian_opus__norwegian&pcq_pos_neg_czech_norwegian_opus__norwegian=pos&iquery_czech_norwegian_opus__norwegian=Praha&queryselector_czech_norwegian_opus__norwegian=iqueryrow&phrase_czech_norwegian_opus__norwegian=&word_czech_norwegian_opus__norwegian=&char_czech_norwegian_opus__norwegian=&cql_czech_norwegian_opus__norwegian=&filter_nonempty_czech_norwegian_opus__norwegian=on&fc_lemword_window_type=both&fc_lemword_wsize=5&fc_lemword=&fc_lemword_type=all|Parallel search for "Praha"]] [[http://corpora.fi.muni.cz/habit/run.cgi/first?corpname=czech_norwegian_opus__czech&reload=&iquery=berla&queryselector=iqueryrow&lemma=&phrase=&word=&char=&cql=&default_attr=word&sel_aligned=czech_norwegian_opus__norwegian&pcq_pos_neg_czech_norwegian_opus__norwegian=pos&iquery_czech_norwegian_opus__norwegian=&queryselector_czech_norwegian_opus__norwegian=iqueryrow&phrase_czech_norwegian_opus__norwegian=&word_czech_norwegian_opus__norwegian=&char_czech_norwegian_opus__norwegian=&cql_czech_norwegian_opus__norwegian=&filter_nonempty_czech_norwegian_opus__norwegian=on&fc_lemword_window_type=both&fc_lemword_wsize=5&fc_lemword=&fc_lemword_type=all|Czech word "berla"]] [[http://corpora.fi.muni.cz/habit/index.html|Access the corpus]] === Norwegian words with more than 100,000 occurrences === ||er || 821,781|| ||det || 589,721|| ||du || 554,116|| ||Jeg || 547,501|| ||ikke || 506,186|| ||jeg || 418,217|| ||en || 360,871|| ||i || 341,400|| ||har || 315,050|| ||Det || 310,092|| ||på || 307,877|| ||å || 296,603|| ||og || 293,047|| ||til || 271,992|| ||deg || 259,043|| ||meg || 245,155|| ||med || 242,594|| ||for || 213,835|| ||Du || 211,802|| ||at || 204,376|| ||som || 203,379|| ||vi || 171,073|| ||var || 165,487|| ||kan || 162,222|| ||av || 160,980|| ||om || 149,962|| ||den || 148,767|| ||vil || 147,605|| ||så || 147,174|| ||Vi || 145,267|| ||et || 138,850|| ||han || 126,251|| ||skal || 119,570|| ||Hva || 110,797|| ||de || 110,202|| ||Han || 107,929|| ||må || 101,278|| === Czech words with more than 100,000 occurrences === ||to || 656,606|| ||se || 560,332|| ||je || 422,521|| ||že || 345,153|| ||na || 327,317|| ||jsem || 309,133|| ||a || 297,950|| ||si || 231,641|| ||v || 201,975|| ||co || 172,431|| ||To || 160,908|| ||s || 152,526|| ||A || 149,175|| ||mi || 142,779|| ||mě || 132,047|| ||tak || 121,439|| ||jsi || 118,647|| ||do || 113,030|| ||o || 112,856|| ||Je || 106,979|| === Example parallel segments === ||=Norwegian=||=Czech=|| ||Om jeg hadde $ 300, kunne jeg kommet meg til Tyskland.||Ne, ale kdybych měl 300$, dostal bych se do Německa.|| ||Aldri i livet! ||Až naprší a uschne.|| ||Jeg vil bli her... og fiske, slik Manuel gjorde.||Chci zůstat tady... a jezdit na ryby, jako Manuel.|| ||Transilvania.||Transylvánie.|| ||"Polka-Dot banditten og gjengen beskyldt for å utføre røveriet"||"Podezření padá na banditu Polka-Dot ."|| ||Fortsette som før?||Jako dřív?|| ||Nå har vi rikelig med sol for smilefjeset.||Teď svítí sluníčko pro pana Šťastného.|| ||Det minner meg om de ødelagte forsvarsverker på mitt eget slått i Transilvania.||Připomíná mi to zchátralé cimbuří mého vlastního hradu v Transylvánii.|| ||Ikke minn meg på det.||Nepřipomínej mi to.|| ||Følge etter?||- Sledovat?|| == Corpus query interface == The corpus has been indexed by corpus manager and query system Sketch Engine [2]. The corpus can be searched at http://corpora.fi.muni.cz/habit/. == References == - [1] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36. - [2] -- Jörg Tiedemann, 2009, News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. In N. Nicolov and K. Bontcheva and G. Angelova and R. Mitkov (eds.) Recent Advances in Natural Language Processing (vol V), pages 237-248, John Benjamins, Amsterdam/Philadelphia