wiki:ParallelCzechNorwegian

D2.4: Parallel Czech-Norwegian corpus, size 10 million tokens

Source

The source for this corpus was taken from OpenSubtitles corpus made available within OPUS2 parallel corpus [2] and originally taken from http://www.opensubtitles.org/. As many subtitles are very old, encoding must be determined automatically, various formats must be converted into XML, text segmented into sentences and tokens. OCR errors (many subtitles are automatically extracted from video streams) were corrected using noisy-channel principle with language models trained on Google N-grams data. Metadata (info about films, date of creating the subtitles etc.) are not preserved in the final parallel corpus. Document-level alignment were done using heuristic function, sentence(segment)-level alignment was done using time overlaps of intervals of subtitles.

Statistics

PartTokensWordsSegments
Czech 32,345,496 24,101,302 4,235,111
Norwegian 32,549,746 25,503,941 dtto

Examples

Parallel search for "Praha"

Czech word "berla"

Access the corpus

Norwegian words with more than 100,000 occurrences

er 821,781
det 589,721
du 554,116
Jeg 547,501
ikke 506,186
jeg 418,217
en 360,871
i 341,400
har 315,050
Det 310,092
307,877
å 296,603
og 293,047
til 271,992
deg 259,043
meg 245,155
med 242,594
for 213,835
Du 211,802
at 204,376
som 203,379
vi 171,073
var 165,487
kan 162,222
av 160,980
om 149,962
den 148,767
vil 147,605
147,174
Vi 145,267
et 138,850
han 126,251
skal 119,570
Hva 110,797
de 110,202
Han 107,929
101,278

Czech words with more than 100,000 occurrences

to 656,606
se 560,332
je 422,521
že 345,153
na 327,317
jsem 309,133
a 297,950
si 231,641
v 201,975
co 172,431
To 160,908
s 152,526
A 149,175
mi 142,779
132,047
tak 121,439
jsi 118,647
do 113,030
o 112,856
Je 106,979

Example parallel segments

NorwegianCzech
Om jeg hadde $ 300, kunne jeg kommet meg til Tyskland.Ne, ale kdybych měl 300$, dostal bych se do Německa.
Aldri i livet! Až naprší a uschne.
Jeg vil bli her... og fiske, slik Manuel gjorde.Chci zůstat tady... a jezdit na ryby, jako Manuel.
Transilvania.Transylvánie.
"Polka-Dot banditten og gjengen beskyldt for å utføre røveriet""Podezření padá na banditu Polka-Dot ."
Fortsette som før?Jako dřív?
Nå har vi rikelig med sol for smilefjeset.Teď svítí sluníčko pro pana Šťastného.
Det minner meg om de ødelagte forsvarsverker på mitt eget slått i Transilvania.Připomíná mi to zchátralé cimbuří mého vlastního hradu v Transylvánii.
Ikke minn meg på det.Nepřipomínej mi to.
Følge etter?- Sledovat?

Corpus query interface

The corpus has been indexed by corpus manager and query system Sketch Engine [2]. The corpus can be searched at http://corpora.fi.muni.cz/habit/.

References

  • [1] -- Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. "The Sketch Engine: ten years on." Lexicography 1, no. 1 (2014): 7-36.
  • [2] -- Jörg Tiedemann, 2009, News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. In N. Nicolov and K. Bontcheva and G. Angelova and R. Mitkov (eds.) Recent Advances in Natural Language Processing (vol V), pages 237-248, John Benjamins, Amsterdam/Philadelphia?
Last modified 7 years ago Last modified on Jan 17, 2017, 9:17:14 PM