| 1 | = D4.3: Visualization tool for Sketch Grammar queries = |
| 2 | |
| 3 | This tool is aimed at experienced users of the word sketches who want to understand precisely how the collocates were extracted from the corpus. |
| 4 | |
| 5 | = Building Word Sketches = |
| 6 | |
| 7 | The word sketch approach consists in combining the statistics |
| 8 | with of manually defined syntactic rules that limit what counts |
| 9 | as a co-occurrence of particular two words – a sketch grammar. The core |
| 10 | of a sketch grammar is a set of queries in corpus query language CQL, |
| 11 | each with marked position of the two words that can form a collocation. |
| 12 | Only the words matching one of the queries are then considered as co-occurrences |
| 13 | for statistical computations. Each of the CQL queries is associated |
| 14 | with a label that describes grammatical relationship between the particular |
| 15 | two words. One grammatical label (or relation) can be assigned to |
| 16 | multiple queries. |
| 17 | |
| 18 | Macros in the {{{m4}}} format can be used in the sketch grammar, so that |
| 19 | the developer does not have to repeat potentially complex queries with regular |
| 20 | expressions, and can assign a label to each such query, e.g. ''noun'' for {{{[tag="N.*"]}}}. |
| 21 | |
| 22 | Also, a few so-called processing directives can be used, which modify the evaluation of the queries. |
| 23 | For example, the *DUAL directive allows defining two complement relations (e.g. "modifier of X" vs. "words modified by X") by only one CQL query -- the query is also evaluated only once which makes creating the two relations more efficient. Another possibility is to include a third word into the relationship, more |
| 24 | precisely into a grammatical relation name. This is done by the *TRINARY |
| 25 | directive – a third word can be labelled by “3:” within the CQL queries |
| 26 | and this word replaces the string “%s” in the relation name, potentially creating |
| 27 | a large number of different grammatical relations. |
| 28 | |
| 29 | In the process of sketch grammar development, the developer needs to have the relation names, macros and directives clearly marked so that he does not have to read every piece of the code. Also, for an expert user who wants to be able to reconstruct how a grammatical relation is built up, it will be easier to read and understand the CQL definition. |
| 30 | |
| 31 | = Visualization tool = |
| 32 | |
| 33 | Therefore, we have improved the system of syntax highlighting that was available within the Sketch Engine tool. The sketch grammar of a particular corpus can be displayed by clicking on one of the grammatical relation names, which takes the user to an HTML colour display of the sketch grammar. |
| 34 | |
| 35 | As shown in Figure 1, the macros and directives are in blue, the grammatical relation names are green, as well as the labels of tokens/words that go into the relation. Comments are in dark red, all the rest (use of the macros, as well as common CQL queries) in black. Also, the sketch grammar is formatted in paragraphs in HTML so that its structure is clear. |
| 36 | |
| 37 | [[Image(wsdef.png)]] |
| 38 | |
| 39 | On top of this functionality, it is also possible to find out how many hits a particular query had, as well as any errors in the evaluation of the queries -- all of this is logged, and the logs are accessible to the users within the Corpus Architect interface (in cases where they are allowed to view them). Together with the described visualization tool, this functionality provides a convenient environment for the sketch grammar developer, as well as for the expert users that want to know how the system work inside. |
| 40 | |