D4.4a: An improved definition of Word Sketches for Czech

This report describes the new sketch grammar for Czech created within the scope of the project.

State of the art

Unlike the other languages from the project, Czech already had a developed sketch grammar before the project started. However, the grammar contained several significant problems, namely:

  1. the grammatical relation names were not really readable for anyone who was not familiar with the sketch grammar in detail, which led to frequent confusions and misunderstandings among users
  2. there were very visible errors in the word sketches, partly due to systematic tagging errors, but partly also because of the sketch grammar definitions that could filter high portion of these errors out

New sketch grammar

In the new sketch grammar for Czech, we addressed both of these issues. We have renamed the relation names according to the template used in the English and Spanish grammars within the Sketch Engine (so far the two most developed grammars), and we have carefully revised all the rules, with special attention to the ones producing frequent errors. We have also rationalised the definitions (also according to the practice used in English and Spanish grammars), so that they are more readable now and make future modifications easy. We have also added a mapping of the relation names to template names which will enable matching the new Czech word sketches to word sketches in other languages, in the Bilingual word sketch application. The definitions of the relations are based on the Majka part-of-speech tagset used by the tagger which is integrated into Sketch Engine.

In total, there are 33 relations in the current version of the grammar, covering the most important grammatical phenomena, such as modifiers of all parts of speech, subjects, objects, predicates and coordinations. A few relations were added, e.g. noun is ... or subject of be adj, to meet the templates determined by the English and Spanish grammars.

The examples below discuss the most important differences between the old version of the Czech word sketches, and the new one.

Figure 1: Old word sketch for verb vidět (to see). Almost all items in the relation has_subj are in fact objects -- the wrong items are there because of the nominative-accusative homonymy in the Czech language, and subsequent systematic tagging errors. However, many of the errors could be eliminated by adjusting the sketch grammars rules for the subject relation.

Figure 2: New word sketch for verb vidět (to see), created within the scope of the project.

Comparison of Figures 1 and 2 shows the improvement:

  • almost all items in the subject relation were wrong in the old version; in the current version, almost all the displayed items are correct
  • the names of the grammatical relations are much more readable, even for people that do not know the grammar in detail
  • preposition phrases are organised into one meta-relation (or TRINARY relation) prepositional phrases, according to the templates from English and Spanish grammars, which provides better summary of how the word combines with various prepositions

Another example in Figure 3 shows the new Czech word sketch relations for a noun.

Figure 3: New word sketch for noun auto (car)

Last modified 5 years ago Last modified on May 31, 2017, 12:59:08 AM

Attachments (3)

Download all attachments as: .zip