wiki:SketchVisualization

D4.3: Visualization tool for Sketch Grammar queries

This tool is aimed at experienced users of the word sketches who want to understand precisely how the collocations were extracted from the corpus.

Building Word Sketches

The word sketch approach consists in combining the statistics with a set of manually defined syntactic rules that limit what counts as a co-occurrence of particular two words – a sketch grammar. The core of a sketch grammar is a set of queries in corpus query language CQL, each with marked position of the two words that can form a collocation. Only the words matching one of the queries are then considered as co-occurrences for statistical computations. Each of the CQL queries is associated with a label that describes grammatical relationship between the particular two words. One grammatical label (or relation) can be assigned to multiple queries.

Macros in the m4 format can be used in the sketch grammar, so that the developer does not have to repeat potentially complex queries with regular expressions, and can assign a label to each such query, e.g. noun for [tag="N.*"].

Also, a few so-called processing directives can be used, which modify the evaluation of the queries. For example, the *DUAL directive allows defining two complement relations (e.g. "modifier of X" vs. "words modified by X") by only one CQL query -- the query is also evaluated only once which makes creating the two relations more efficient. Another possibility is to include a third word into the relationship, more precisely into a grammatical relation name. This is done by the *TRINARY directive – a third word can be labelled by “3:” within the CQL queries and this word replaces the string “%s” in the relation name, potentially creating a large number of different grammatical relations.

In the process of sketch grammar development, the developer needs to have the relation names, macros and directives clearly marked so that he does not have to read every piece of the code. Also, for an expert user who wants to be able to reconstruct how a grammatical relation is built up, it will be easier to read and understand the CQL definition, if the units with different meaning are marked differently.

Visualization tool

Therefore, we have improved the system of syntax highlighting that was available within the Sketch Engine tool. The sketch grammar of a particular corpus can be displayed by clicking on one of the grammatical relation names, which takes the user to an HTML colour display of the sketch grammar.

As shown in Figure 1, the macros and directives are in blue, the grammatical relation names are green, as well as the labels of tokens/words that go into the relation. Comments are in dark red, all the rest (use of the macros, as well as common CQL queries) in black. Also, the sketch grammar is formatted in paragraphs in HTML so that its structure is clear.

Figure 1. Screenshot from the visualization tool for sketch grammar queries -- Amharic sketch grammar.

On top of this functionality, it is also possible to find out how many hits a particular query had, as well as any errors in the evaluation of the queries -- all of this is logged, and the logs are accessible to the users within the Corpus Architect interface (in cases where they are allowed to view them). Together with the described visualization tool, this functionality provides a convenient environment for the sketch grammar developer, as well as for the expert users that want to know how the system works inside.

Last modified 7 years ago Last modified on May 31, 2017, 3:44:08 PM

Attachments (1)

Download all attachments as: .zip