Changes between Version 1 and Version 2 of WordSketchGrammars


Ignore:
Timestamp:
Jan 15, 2016, 9:51:20 AM (9 years ago)
Author:
hales
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • WordSketchGrammars

    v1 v2  
    11= D1.1.3 Specification of word-sketch grammars and tools =
     2== The concept of a Word Sketch == #docs-internal-guid-33ee83b6-447a-e7a2-1eff-7d803eb0fa06
     3A word sketch is a “one-page automatic, corpus-based summaries of a word’s grammatical and collocational behaviour” (Kilgarriff, 2004) and is a key feature of the Sketch Engine corpus management system (Kilgarriff, 2004, 2014). Word sketches have been introduced as a practical concept in computer lexicography to help lexicographers learn about how words behave in particular grammatical relations. They give a quick response to basic questions like
     4
     5[[BR]]''What are the most salient subjects of a given verb?''
     6
     7''What are the most salient modifiers of a given noun?''
     8
     9''What are the most salient verbs used with a given noun as object?''
     10
     11[[BR]]The key properties of a word sketch, as given in the definition are:
     12
     13[[BR]]
     14 * they are data-driven, i.e. corpus-based, with links back to the underlying corpus evidence, i.e. allowing inspection of particular concordance lines which were the source for the computation
     15
     16
     17 * they are produced in a fully automatic way, based on a hybrid rule-based and statistical approach (as explained further)
     18
     19
     20 * they are clustered into meaningful language- and task-dependent relations, usually corresponding to the syntactic functions of the collocates
     21
     22
     23 * they are easy to read, in a tabular format, with every column representing one grammatical relation (see Figure 1)[[BR]]
     24
     25
     26
     27Figure 1: Example word sketch table from the SkELL
     28
     29[[BR]]From a syntax theory point of view, word sketches represent an instance of the dependency syntax formalism allowing ambiguity, i.e. forming a dependency multigraph instead of a dependency tree. Similarly to a classical dependency relation, a word sketch relation is a binary relation, which:
     30
     31[[BR]]
     32 * might be reflexive (reflexive relations are called “constructions”)
     33
     34
     35 * might be symmetric
     36
     37
     38 * is not transitive
     39
     40
     41 * is labeled (i.e. named)
     42
     43
     44 * is scored
     45
     46
     47
     48[[BR]]Given that the relation is labeled, items of the relation are usually referred to as word sketch triples consisting of a headword, a relation name and a collocation, such as (resource; modifier; scarce). Such a triple is also assigned a score (a floating point number) representing the strength of the association between headword and collocation where particular semantics depends on the association measure that has been used to compute the score, as explained later.
     49
     50[[BR]]From a more technical perspective, we actually deal with word sketch quintuples because as we mentioned earlier, one wants to keep links to particular corpus positions where a word sketch occurs, and hence the quintuples contain position of the occurrence of the headword and collocation in the corpus (as integer numbers) in addition to the triples. Therefore, instead of a single item such as (resource; modifier; scarce), one gets a set of quintuples (resource; modifier; scarce; pos1; pos2) corresponding to individual occurrences of these words in the given grammatical relation in the corpus.
     51
     52== Word Sketch Computation, Indexing and Scoring ==
     53The word sketch formalism has been fully implemented within the Manatee corpus management system (Rychlý, 2000; Rychlý, 2007). The implementation consists of three loosely coupled modules: the first one responsible for computing of word sketch triples, the second for efficient indexing that enables fast search and retrieval of word sketch items, and the third for scoring of indexed word sketches.
     54
     55[[BR]]It is worth pointing out that particularly the module for word sketch computation is entirely independent of the module for word sketch indexing, allowing indexation of word sketch triples originating from various tools (not only from the module as provided in Manatee, but also e.g. from a common dependency parser).
     56
     57[[BR]]In further subsections we describe each of these modules in detail.
     58
     59== Word Sketch Computation ==
     60The word sketch computation module in Manatee is based on the concept of a word sketch grammar, a plain text file specifying multiple word sketch relations as regular expressions over corpus positions, usually exploiting morphological annotation if the corpus. Definitions of word sketch relations are actually standard corpus queries written in the Corpus Query Language (Jakubíček, 2010) which is used in Manatee. An example of a simple word sketch relation definition is given below:
     61
     62[[BR]]|| =modifier[[BR]][[BR]] 2:[tag=”ADJ”] 1:[tag=”NOUN”][[BR]][[BR]] 2:[word=”.*ing”] 1:[tag=”NOUN”] ||
     63
     64[[BR]]In this example, the modifier relation is defined using two CQL expressions (combined as logical OR) as any adjectives or -ing forms preceding a noun. In the CQL expression, the headword is always prefixed with a “1:” label and the collocation with a “2:” label. In the word sketch computation module, queries for all relations are executed over an already indexed corpus in Manatee, finally yielding the word sketch quintuples which are then subject to indexation.
     65
     66[[BR]]A word sketch grammar is a set of word sketch relations, each of which may be defined using multiple corpus queries. The sketch grammar formalism is very flexible and features a number of metagrammatical concepts easing the handcrafting and maintenance of the grammar, such as use of M4 macros (Kernighan and Ritchie, 1977) and a large number of directives simplifying the query specification. Hereby we list some of the most important directives:
     67
     68 * STRUCTLIMIT - if set to a corpus structure such as paragraph or sentence, all query matching following this directive throughout the whole grammar will stop at structure boundaries.
     69
     70
     71 * DEFAULTATTR - specifies default corpus attribute for all following queries in the sketch grammar.
     72
     73
     74 * DUAL - the name of the relation following this directive is split on a slash character, where first part of the name corresponds to the name of the relation as given, and the second part names a dual relation with headword and collocation interchanged.
     75
     76
     77 * SYMMETRIC - assumes that the relation following this directive is symmetric. In other words, SYMMETRIC is an instance of DUAL where both relations have the same name.
     78
     79
     80 * COLLOC - this directive allows forming the collocation string from multiple query parts (labeled with numbers as seen in the example.
     81
     82
     83
     84[[BR]]The following example sketch grammar facilitates all the directives listed above:
     85
     86[[BR]]|| *STRUCTLIMIT s[[BR]][[BR]]*DEFAULTATTR tag[[BR]][[BR]][[BR]]*DUAL[[BR]][[BR]]=modifier/modifies[[BR]][[BR]]  2:"ADJ"  "AD[JV]"{0,3} 1:"N"[[BR]][[BR]]  1:"N"  "ADJ"? 2:"ADJ"[[BR]][[BR]]  1:"V.*" 2:"ADV"[[BR]][[BR]]  2:"ADV" 1:"V.*"[[BR]][[BR]][[BR]]*DUAL[[BR]][[BR]]=object/object_of[[BR]][[BR]]  1:"V.*" "ADJ|ADV|DET"{0,3} 2:"N"[[BR]][[BR]][[BR]]*SYMMETRIC[[BR]][[BR]]=and_or[[BR]][[BR]]  1:"N" "ADJ.?"{0,3} [lemma="et|ou"] "DET.*"? "AD[JV]"{0,3} 2:"N"[[BR]][[BR]]  1:"V.*" [lemma="et|ou"] [lemma="de|être|avoir"|tag="ADV"]? 2:"V.*"[[BR]][[BR]]  1:"ADJ" "NOM" 2:"ADJ"[[BR]][[BR]][[BR]]=prep_phrase[[BR]][[BR]]*COLLOC "%(3.word)_%(2.word)-p"[[BR]][[BR]]  1:[tag="N.."|tag="AJ."] 3:"PR." 2:"N.." ||
     87
     88[[BR]][[BR]]A full documentation to the sketch grammar formalism is available online.
     89
     90[[BR]]So far sketch grammars for 26 languages have been devised during Sketch Engine development.
     91
     92[[BR]]
     93=== Alternative Approaches to Word Sketch Computation ===
     94As we announced earlier, the word sketch generation module in Manatee can be easily replaced by any other tool for generating dependency triples. In particular, one can easily exploit any existing dependency parser for generating word sketches, as we have shown in (Kilgarriff, 2014). Moreover, since word sketches do not require unambiguous representation of a text/sentence, they can be generated also from dependency graphs as was demonstrated in (Horák, 2009).
     95
     96[[BR]]Usually converting the output of a dependency parser to the one necessary for word sketch indexing is very straightforward, and Manatee can directly do this task for the CoNLL dependency format.
     97
     98== Word Sketch Indexing ==
     99The word sketch indexing takes word sketch quintuples as input and stores them in an efficient indexing structure which is part of Manatee and is designed to enable fast execution of following queries:
     100
     101[[BR]]
     102 * list of all headwords in the database
     103
     104
     105 * for a given headword, what are the associated relations
     106
     107
     108 * for a given headword and relation, what are the associated collocations
     109
     110
     111 * for a given headword, relation and collocation, what are the corpus occurrences (positions) of headword and collocation
     112
     113
     114 * for a given headword, relation and collocation, what is the number of corpus occurrences of headword and collocation
     115
     116
     117 * for a given headword, relation and collocation, what is the associated score
     118
     119
     120
     121[[BR]]
     122== Word Sketch Scoring ==
     123On top of an indexed word sketch database, the statistical components of Manatee finally perform computation of the score for each word sketch triple. A number of lexicographic association measures is implemented in the scoring module (such as T-score, MI-score etc.) with the default being a modification of the Dice coefficient called the logDice association score (Rychlý, 2008) defined for a given headword w1, relation Rand collocation w2as:
     124
     125[[BR]]14 +log2 Dice w1; R; w2w1; R; *;w1; R; w2*; R; w2= 14 + log22 * w1; R; w2w1;R; *+*; R; w2
     126
     127[[BR]]At the end of the computation, the scores are stored in the word sketch database.
     128
     129== Implementation and tools ==
     130The word sketch machinery as described above is all part of Manatee libraries written in C++. From a user perspective, each of the modules is controlled with single binary executable file used from Linux command line:
     131
     132[[BR]]
     133 * word sketch generation: genws
     134
     135
     136 * word sketch indexing: mkwmap
     137
     138
     139 * word sketch scoring: mkwmrank
     140
     141
     142
     143[[BR]]In the following we describe the usage and interaction between these three tools.
     144
     145=== genws ===
     146genws takes four parameters:
     147
     148[[BR]]
     149 * the name of the corpus (indexed in Manatee databases)
     150
     151
     152 * the name of the corpus positional attribute which will be used for the lookup of headwords and collocates (usually a lemma, word form, or combination of lemma and part-of-speech abbreviation called a lempos)
     153
     154
     155 * the output data path
     156
     157
     158 * the word sketch grammar file (having .wsdef extension by convention)
     159
     160
     161
     162[[BR]]The tools sequentially processes the given word sketch grammar file, evaluates all queries and send word sketch quintuples to its standard output as binary 64 bit integers. Its output can be inspected with standard Unix tools such as od:
     163
     164[[BR]]genws bnc lemma /out/wsdb english.wsdef | od -t d8 -w40
     165
     166[[BR]]0000000     24            0           27           27           30
     167
     1680000050     27            0           24           30           27
     169
     1700000120     41            0           42           49           51
     171
     1720000170     42            0           41           51           49
     173
     1740000240     36            0           57           76           80
     175
     176![...]
     177
     178[[BR]]E.g. the first line contains a word sketch quintuple for headword with id 24, relation with id 0, collocation with id 27 occurring on positions 27 (headword) and 30 (collocation).
     179
     180=== mkwmap ===
     181The output of genws is normally directly passed to mkwmap which reads the word sketch quintuples from standard input and proceeds in three phases:
     182
     183[[BR]]
     184 1. sorting of the word sketch quintuples (numerically by headword, relation, and collocation) in a number of batches so that they fit into the computer memory
     185
     186
     187 1. joining of sorted batches
     188
     189
     190 1. writing the indices on disk
     191
     192
     193
     194[[BR]]mkwmap takes a single obligatory argument pointing to the same data path as was used in genws. An example call would be:
     195
     196[[BR]]genws bnc lemma /out/wsdb english.wsdef | mkwmap /out/wsdb
     197
     198=== mkwmrank ===
     199mkwmrank is called after a successful word sketch generation and indexation by genws and mkwmrank. It operates on top of an existing word sketch database indexed by mkwmrank and updates this database with scores that it first computes. It takes a single obligatory argument pointing to the output data path (i.e. same argument as mkwmrank).
     200
     201[[BR]]The whole processing pipeline would thus be:
     202
     203[[BR]]genws bnc lemma /out/wsdb english.wsdef | mkwmap /out/wsdb; mkwmrank /out/wsdb
     204
     205== Performance and Effectivity ==
     206The word sketch format has been under continuous development since 2004. Besides various extensions to the word sketch functionality itself, such as multi-word sketches (Kilgarriff, 2012) or bilingual sketches (Baisa, 2014), we have been focusing on improvements in terms of both time and space efficiency of the word sketch computation.
     207
     208=== Time complexity ===
     209The time necessary for compiling word sketch indices needs to be assessed according to the three independent phases (computation, indexing, scoring).
     210
     211[[BR]]The most consuming part if word sketch computation, which depends on the number of relations and complexity of the corpus queries. The indexation phase is not very CPU intensive, log-linear with regard to the number of quintuples due to the sorting, and in terms of speed mainly dependent on the speed of the underlying storage performing I/O operations. The scoring phase is very fast since it facilitates existing indices, and usually takes less than 0.5 % time of the overall computation.
     212
     213[[BR]]Figure 2: Trivial parallelization
     214
     215[[BR]]From this follows that clearly any efforts towards speeding up the computation should be devoted to the computation, i.e. query evaluation phase. In (Pomikálek, 2012) we have shown a parallelization approach which, depending on the structure of the word sketch relation, allows close to a linear speed up of this phase with regard to the number of processing cores used.
     216
     217[[BR]]In natural language processing often parallelization can be done in the most trivial way: by splitting data to be processed into N parts and running N independent tasks (see Figure 2). However, in case of a corpus management system, this is often not possible because of the underlying string-to-number mapping which needs to be consistent across a single corpus, and hence shared during parallel processing. As such, it easily represents a bottleneck severely limiting potential speedup (see Figure 3). In (Jakubíček, 2014) we describe a general mechanism which deals with this issue and is called corpus virtualization.
     218
     219Figure 3: Shared lexicon as a parallelization bottleneck
     220
     221=== Space complexity ===
     222In 2015 a major overhaul of the indexing format has been performed in order to improve space efficiency. In the following table we provide a comparison of disk space occupied by indices in the two most recent word sketch formats:
     223
     224[[BR]]|| corpus || word sketch format 3[[BR]](since 2011) || word sketch format 4[[BR]](since 2015) ||
     225|| enTenTen13[[BR]](~22 billion tokens, Jakubíček et al., 2013) || ~370 GB || ~90 GB ||
     226
     227[[BR]]
     228== Qualitative Evaluation ==
     229Besides technical aspects of word sketches processing, a very important research question is the one of their quality. Since 2012 this has been subject to extensive research and is reported as a separate deliverable in D4.1: Methodology of Sketch Grammar evaluation.
     230
     231== Recent Extensions to the Word Sketches ==
     232In this section we describe recent enhancements to the basic word sketch functionality described so far.
     233
     234=== Multiword sketches ===
     235Originally word sketch headwords used to be always single word units -- depending on particular tokenization of the underlying corpus. In (Kilgarriff, 2012) the word sketch functionality has been extended towards multiword units. The approach taken is scalable in terms that the multiword unit length is not limited and only depends on the corpus evidence that can be found.
     236
     237[[BR]]Technically multiword sketches are “filtered sketches”, e.g. for a noun phrase ''young man'', one retrieves word sketches for ''man'', looks up ''young'' among its collocates, and filters the word sketches only for corpus positions where ''young'' a collocate of ''man'' (in any relation). The same process is repeated starting from ''young'' with ''man'' as its collocate.
     238
     239[[BR]]The resulting multiword sketch is depicted in Figure 4. Note that this process can be repeated to so as to retrieve word sketch for e.g. ''handsome young man''.
     240
     241=== Bilingual word sketches ===
     242Another extension to the word sketch functionality relates to the application of word sketches on parallel corpora and bilingual texts in general, e.g. to perform a contrastive collocational analysis of a word and its translation equivalent. This can be done in “three flavours”, all of which are described in detail in (Baisa, 2014):
     243
     244[[BR]]
     245 1. Bilingual parallel word sketch (BIP)
     246
     247
     248
     249In case of two parallel corpora, i.e. corpora with existing alignment on sentence or paragraph level, we have devised algorithms for automatic computation of translation candidates based on such alignment. We then display word sketches for the top translation candidate, with aligned grammatical relations, and also show aligned segments with usages for individual collocations. An example parallel word sketch is provided in Figure 5.
     250
     251[[BR]][[BR]]Figure 5: Bilingual parallel word sketch
     252
     253[[BR]]
     254 2. Bilingual comparable word sketch (BIC)
     255
     256
     257
     258In case of two comparable corpora, i.e. corpora which are not parallel but roughly of same size and same text types, we can leverage the translation candidates computer for the same pair of languages on two parallel corpora as in the first case, and proceed similarly, except we cannot show any aligned segments obviously.
     259
     260[[BR]]
     261 3. Bilingual manual word sketch (BIM)
     262
     263
     264
     265In the last case, the user is responsible for providing the translation manually, and we then simply show the bilingual word sketch for the given translation.
     266
     267[[BR]]Figure 4: Multiword sketch for ''young man''
     268
     269[[BR]]
     270=== Longest-commonest match ===
     271In some cases it is not very clear how the collocation is used with the headword. In simple cases where the collocation is an adjective and the headword is a noun, the whole phrase is immediately clear (''young man''), however in many others it is much more difficult: consider headword ''resource'' and collocation ''management'' in the relation ''modifies''. Reviewing all the concordance lines actually reveals that the most common (hence longest-commonest) match is ''human resources management''.
     272
     273[[BR]]To spare users from inspecting concordance in these cases, recently the word sketch format has been enhanced so that it contains what we call longest-commonest match: the most common phrase built from the headword and collocation in the given grammatical relation. The algorithm for building longest-commonest match has been described in detail in (Kilgarriff, 2015).
     274
     275== References ==
     276BAISA, Vít, et al. Bilingual word sketches: the translate button. In: ''Proc EURALEX''. 2014. p. 505-513.
     277
     278HORÁK, Aleš; RYCHLÝ, Pavel; KILGARRIFF, Adam. Czech word sketch relations with full syntax parser.''After Half a Century of Slavonic Natural Language Processing'', 2009, 101-112.
     279
     280JAKUBÍČEK, Miloš, et al. Fast Syntactic Searching in Very Large Corpora for Many Languages. !In:PACLIC. 2010. p. 741-747.
     281
     282JAKUBÍČEK, Miloš, et al. The tenten corpus family. In: ''7th International Corpus Linguistics Conference CL''. 2013. p. 125-127.
     283
     284JAKUBÍCEK, Miloš; KILGARRIFF, Adam; RYCHLÝ, Pavel. Effective Corpus Virtualization. In: ''Challenges in the Management of Large Corpora (CMLC-2) Workshop Programme''. p. 7.
     285
     286KERNIGHAN, Brian W.; RITCHIE, Dennis M. The M4 macro processor. Bell Laboratories, 1977.KERNIGHAN, Brian W.; RITCHIE, Dennis M. The M4 macro processor. Bell Laboratories, 1977.
     287
     288KILGARRIFF, Adam, et al. Itri-04-08 the sketch engine. ''Information Technology'', 2004, 105: 116.
     289
     290KILGARRIFF, Adam, et al. Finding multiwords of more than two words. ''Proceedings of EURALEX2012'', 2012.
     291
     292KILGARRIFF, Adam, et al. The Sketch Engine: ten years on. Lexicography, 2014, 1.1: 7-36.
     293
     294KILGARRIFF, Adam, et al. Extrinsic corpus evaluation with a collocation dictionary task. In:''Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014)''. 2014.
     295
     296KILGARRIFF, Adam, et al. Longest–commonest Match. 2015.
     297
     298POMIKÁLEK, Jan; JAKUBÍCEK, Milos; RYCHLÝ, Pavel. Building a 70 billion word corpus of English from !ClueWeb. In: ''LREC''. 2012. p. 502-506.
     299
     300RYCHLÝ, Pavel, et al. Korpusové manažery a jejich efektivní implementace. 2000.
     301
     302RYCHLÝ, Pavel. Manatee/bonito-a modular corpus manager. In: ''1st Workshop on Recent Advances in Slavonic Natural Language Processing''. within MU: Faculty of Informatics Further information, 2007. p. 65-70.
     303
     304RYCHLÝ, Pavel. A lexicographer-friendly association score. ''Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN'', 2008, 6-9.