Changes between Version 2 and Version 3 of WordSketchGrammars


Ignore:
Timestamp:
Jan 17, 2016, 10:18:34 PM (9 years ago)
Author:
xkocinc
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • WordSketchGrammars

    v2 v3  
    11= D1.1.3 Specification of word-sketch grammars and tools =
    22== The concept of a Word Sketch == #docs-internal-guid-33ee83b6-447a-e7a2-1eff-7d803eb0fa06
    3 A word sketch is a “one-page automatic, corpus-based summaries of a word’s grammatical and collocational behaviour” (Kilgarriff, 2004) and is a key feature of the Sketch Engine corpus management system (Kilgarriff, 2004, 2014). Word sketches have been introduced as a practical concept in computer lexicography to help lexicographers learn about how words behave in particular grammatical relations. They give a quick response to basic questions like 
     3A word sketch is a “one-page automatic, corpus-based summaries of a word’s grammatical and collocational behaviour” (Kilgarriff, 2004) and is a key feature of the Sketch Engine corpus management system (Kilgarriff, 2004, 2014). Word sketches have been introduced as a practical concept in computer lexicography to help lexicographers learn about how words behave in particular grammatical relations. They give a quick response to basic questions like
    44
    55[[BR]]''What are the most salient subjects of a given verb?''
     
    1111[[BR]]The key properties of a word sketch, as given in the definition are:
    1212
    13 [[BR]]
    1413 * they are data-driven, i.e. corpus-based, with links back to the underlying corpus evidence, i.e. allowing inspection of particular concordance lines which were the source for the computation
    1514
    16 
    1715 * they are produced in a fully automatic way, based on a hybrid rule-based and statistical approach (as explained further)
    1816
    19 
    2017 * they are clustered into meaningful language- and task-dependent relations, usually corresponding to the syntactic functions of the collocates
    2118
    22 
    2319 * they are easy to read, in a tabular format, with every column representing one grammatical relation (see Figure 1)[[BR]]
    2420
    25 
    26 
    2721Figure 1: Example word sketch table from the SkELL
    2822
    2923[[BR]]From a syntax theory point of view, word sketches represent an instance of the dependency syntax formalism allowing ambiguity, i.e. forming a dependency multigraph instead of a dependency tree. Similarly to a classical dependency relation, a word sketch relation is a binary relation, which:
    3024
    31 [[BR]]
    3225 * might be reflexive (reflexive relations are called “constructions”)
    3326
    34 
    3527 * might be symmetric
    3628
    37 
    3829 * is not transitive
    3930
    40 
    4131 * is labeled (i.e. named)
    4232
    43 
    4433 * is scored
    45 
    46 
    4734
    4835[[BR]]Given that the relation is labeled, items of the relation are usually referred to as word sketch triples consisting of a headword, a relation name and a collocation, such as (resource; modifier; scarce). Such a triple is also assigned a score (a floating point number) representing the strength of the association between headword and collocation where particular semantics depends on the association measure that has been used to compute the score, as explained later.
     
    6047The word sketch computation module in Manatee is based on the concept of a word sketch grammar, a plain text file specifying multiple word sketch relations as regular expressions over corpus positions, usually exploiting morphological annotation if the corpus. Definitions of word sketch relations are actually standard corpus queries written in the Corpus Query Language (Jakubíček, 2010) which is used in Manatee. An example of a simple word sketch relation definition is given below:
    6148
    62 [[BR]]|| =modifier[[BR]][[BR]] 2:[tag=”ADJ”] 1:[tag=”NOUN”][[BR]][[BR]] 2:[word=”.*ing”] 1:[tag=”NOUN”] ||
     49|| =modifier[[BR]][[BR]] 2:[tag=”ADJ”] 1:[tag=”NOUN”][[BR]][[BR]] 2:[word=”.*ing”] 1:[tag=”NOUN”] ||
    6350
    6451[[BR]]In this example, the modifier relation is defined using two CQL expressions (combined as logical OR) as any adjectives or -ing forms preceding a noun. In the CQL expression, the headword is always prefixed with a “1:” label and the collocation with a “2:” label. In the word sketch computation module, queries for all relations are executed over an already indexed corpus in Manatee, finally yielding the word sketch quintuples which are then subject to indexation.
     
    6855 * STRUCTLIMIT - if set to a corpus structure such as paragraph or sentence, all query matching following this directive throughout the whole grammar will stop at structure boundaries.
    6956
    70 
    7157 * DEFAULTATTR - specifies default corpus attribute for all following queries in the sketch grammar.
    7258
    73 
    7459 * DUAL - the name of the relation following this directive is split on a slash character, where first part of the name corresponds to the name of the relation as given, and the second part names a dual relation with headword and collocation interchanged.
    7560
    76 
    7761 * SYMMETRIC - assumes that the relation following this directive is symmetric. In other words, SYMMETRIC is an instance of DUAL where both relations have the same name.
    7862
    79 
    8063 * COLLOC - this directive allows forming the collocation string from multiple query parts (labeled with numbers as seen in the example.
    8164
    82 
    83 
    8465[[BR]]The following example sketch grammar facilitates all the directives listed above:
    8566
    86 [[BR]]|| *STRUCTLIMIT s[[BR]][[BR]]*DEFAULTATTR tag[[BR]][[BR]][[BR]]*DUAL[[BR]][[BR]]=modifier/modifies[[BR]][[BR]]  2:"ADJ"  "AD[JV]"{0,3} 1:"N"[[BR]][[BR]]  1:"N"  "ADJ"? 2:"ADJ"[[BR]][[BR]]  1:"V.*" 2:"ADV"[[BR]][[BR]]  2:"ADV" 1:"V.*"[[BR]][[BR]][[BR]]*DUAL[[BR]][[BR]]=object/object_of[[BR]][[BR]]  1:"V.*" "ADJ|ADV|DET"{0,3} 2:"N"[[BR]][[BR]][[BR]]*SYMMETRIC[[BR]][[BR]]=and_or[[BR]][[BR]]  1:"N" "ADJ.?"{0,3} [lemma="et|ou"] "DET.*"? "AD[JV]"{0,3} 2:"N"[[BR]][[BR]]  1:"V.*" [lemma="et|ou"] [lemma="de|être|avoir"|tag="ADV"]? 2:"V.*"[[BR]][[BR]]  1:"ADJ" "NOM" 2:"ADJ"[[BR]][[BR]][[BR]]=prep_phrase[[BR]][[BR]]*COLLOC "%(3.word)_%(2.word)-p"[[BR]][[BR]]  1:[tag="N.."|tag="AJ."] 3:"PR." 2:"N.." ||
     67|| *STRUCTLIMIT s[[BR]][[BR]]*DEFAULTATTR tag[[BR]][[BR]][[BR]]*DUAL[[BR]][[BR]]=modifier/modifies[[BR]][[BR]]  2:"ADJ"  "AD[JV]"{0,3} 1:"N"[[BR]][[BR]]  1:"N"  "ADJ"? 2:"ADJ"[[BR]][[BR]]  1:"V.*" 2:"ADV"[[BR]][[BR]]  2:"ADV" 1:"V.*"[[BR]][[BR]][[BR]]*DUAL[[BR]][[BR]]=object/object_of[[BR]][[BR]]  1:"V.*" "ADJ|ADV|DET"{0,3} 2:"N"[[BR]][[BR]][[BR]]*SYMMETRIC[[BR]][[BR]]=and_or[[BR]][[BR]]  1:"N" "ADJ.?"{0,3} [lemma="et|ou"] "DET.*"? "AD[JV]"{0,3} 2:"N"[[BR]][[BR]]  1:"V.*" [lemma="et|ou"] [lemma="de|être|avoir"|tag="ADV"]? 2:"V.*"[[BR]][[BR]]  1:"ADJ" "NOM" 2:"ADJ"[[BR]][[BR]][[BR]]=prep_phrase[[BR]][[BR]]*COLLOC "%(3.word)_%(2.word)-p"[[BR]][[BR]]  1:[tag="N.."|tag="AJ."] 3:"PR." 2:"N.." ||
    8768
    8869[[BR]][[BR]]A full documentation to the sketch grammar formalism is available online.
     
    9071[[BR]]So far sketch grammars for 26 languages have been devised during Sketch Engine development.
    9172
    92 [[BR]]
    9373=== Alternative Approaches to Word Sketch Computation ===
    9474As we announced earlier, the word sketch generation module in Manatee can be easily replaced by any other tool for generating dependency triples. In particular, one can easily exploit any existing dependency parser for generating word sketches, as we have shown in (Kilgarriff, 2014). Moreover, since word sketches do not require unambiguous representation of a text/sentence, they can be generated also from dependency graphs as was demonstrated in (Horák, 2009).
     
    9979The word sketch indexing takes word sketch quintuples as input and stores them in an efficient indexing structure which is part of Manatee and is designed to enable fast execution of following queries:
    10080
    101 [[BR]]
    10281 * list of all headwords in the database
    10382
    104 
    10583 * for a given headword, what are the associated relations
    10684
    107 
    10885 * for a given headword and relation, what are the associated collocations
    10986
    110 
    11187 * for a given headword, relation and collocation, what are the corpus occurrences (positions) of headword and collocation
    11288
    113 
    11489 * for a given headword, relation and collocation, what is the number of corpus occurrences of headword and collocation
    11590
    116 
    11791 * for a given headword, relation and collocation, what is the associated score
    11892
    119 
    120 
    121 [[BR]]
    12293== Word Sketch Scoring ==
    12394On top of an indexed word sketch database, the statistical components of Manatee finally perform computation of the score for each word sketch triple. A number of lexicographic association measures is implemented in the scoring module (such as T-score, MI-score etc.) with the default being a modification of the Dice coefficient called the logDice association score (Rychlý, 2008) defined for a given headword w1, relation Rand collocation w2as:
     
    130101The word sketch machinery as described above is all part of Manatee libraries written in C++. From a user perspective, each of the modules is controlled with single binary executable file used from Linux command line:
    131102
    132 [[BR]]
    133103 * word sketch generation: genws
    134104
    135 
    136105 * word sketch indexing: mkwmap
    137106
    138 
    139107 * word sketch scoring: mkwmrank
    140 
    141 
    142108
    143109[[BR]]In the following we describe the usage and interaction between these three tools.
     
    146112genws takes four parameters:
    147113
    148 [[BR]]
    149114 * the name of the corpus (indexed in Manatee databases)
    150115
    151 
    152116 * the name of the corpus positional attribute which will be used for the lookup of headwords and collocates (usually a lemma, word form, or combination of lemma and part-of-speech abbreviation called a lempos)
    153117
    154 
    155118 * the output data path
    156119
    157 
    158120 * the word sketch grammar file (having .wsdef extension by convention)
    159121
    160 
    161 
    162122[[BR]]The tools sequentially processes the given word sketch grammar file, evaluates all queries and send word sketch quintuples to its standard output as binary 64 bit integers. Its output can be inspected with standard Unix tools such as od:
    163123
    164124[[BR]]genws bnc lemma /out/wsdb english.wsdef | od -t d8 -w40
    165125
    166 [[BR]]0000000     24            0           27           27           30
    167 
    168 0000050     27            0           24           30           27
    169 
    170 0000120     41            0           42           49           51
    171 
    172 0000170     42            0           41           51           49
    173 
    174 0000240     36            0           57           76           80
     126[[BR]]0000000     24            0           27           27           30
     127
     1280000050     27            0           24           30           27
     129
     1300000120     41            0           42           49           51
     131
     1320000170     42            0           41           51           49
     133
     1340000240     36            0           57           76           80
    175135
    176136![...]
     
    181141The output of genws is normally directly passed to mkwmap which reads the word sketch quintuples from standard input and proceeds in three phases:
    182142
    183 [[BR]]
    184143 1. sorting of the word sketch quintuples (numerically by headword, relation, and collocation) in a number of batches so that they fit into the computer memory
    185144
    186 
    187145 1. joining of sorted batches
    188146
    189 
    190147 1. writing the indices on disk
    191 
    192 
    193148
    194149[[BR]]mkwmap takes a single obligatory argument pointing to the same data path as was used in genws. An example call would be:
     
    217172[[BR]]In natural language processing often parallelization can be done in the most trivial way: by splitting data to be processed into N parts and running N independent tasks (see Figure 2). However, in case of a corpus management system, this is often not possible because of the underlying string-to-number mapping which needs to be consistent across a single corpus, and hence shared during parallel processing. As such, it easily represents a bottleneck severely limiting potential speedup (see Figure 3). In (Jakubíček, 2014) we describe a general mechanism which deals with this issue and is called corpus virtualization.
    218173
    219 Figure 3: Shared lexicon as a parallelization bottleneck 
     174Figure 3: Shared lexicon as a parallelization bottleneck
    220175
    221176=== Space complexity ===
    222177In 2015 a major overhaul of the indexing format has been performed in order to improve space efficiency. In the following table we provide a comparison of disk space occupied by indices in the two most recent word sketch formats:
    223178
    224 [[BR]]|| corpus || word sketch format 3[[BR]](since 2011) || word sketch format 4[[BR]](since 2015) ||
     179|| corpus || word sketch format 3[[BR]](since 2011) || word sketch format 4[[BR]](since 2015) ||
    225180|| enTenTen13[[BR]](~22 billion tokens, Jakubíček et al., 2013) || ~370 GB || ~90 GB ||
    226181
    227 [[BR]]
    228182== Qualitative Evaluation ==
    229183Besides technical aspects of word sketches processing, a very important research question is the one of their quality. Since 2012 this has been subject to extensive research and is reported as a separate deliverable in D4.1: Methodology of Sketch Grammar evaluation.
     
    235189Originally word sketch headwords used to be always single word units -- depending on particular tokenization of the underlying corpus. In (Kilgarriff, 2012) the word sketch functionality has been extended towards multiword units. The approach taken is scalable in terms that the multiword unit length is not limited and only depends on the corpus evidence that can be found.
    236190
    237 [[BR]]Technically multiword sketches are “filtered sketches”, e.g. for a noun phrase ''young man'', one retrieves word sketches for ''man'', looks up ''young'' among its collocates, and filters the word sketches only for corpus positions where ''young'' a collocate of ''man'' (in any relation). The same process is repeated starting from ''young'' with ''man'' as its collocate. 
     191[[BR]]Technically multiword sketches are “filtered sketches”, e.g. for a noun phrase ''young man'', one retrieves word sketches for ''man'', looks up ''young'' among its collocates, and filters the word sketches only for corpus positions where ''young'' a collocate of ''man'' (in any relation). The same process is repeated starting from ''young'' with ''man'' as its collocate.
    238192
    239193[[BR]]The resulting multiword sketch is depicted in Figure 4. Note that this process can be repeated to so as to retrieve word sketch for e.g. ''handsome young man''.
     
    242196Another extension to the word sketch functionality relates to the application of word sketches on parallel corpora and bilingual texts in general, e.g. to perform a contrastive collocational analysis of a word and its translation equivalent. This can be done in “three flavours”, all of which are described in detail in (Baisa, 2014):
    243197
    244 [[BR]]
    245198 1. Bilingual parallel word sketch (BIP)
    246199
    247 
    248 
    249200In case of two parallel corpora, i.e. corpora with existing alignment on sentence or paragraph level, we have devised algorithms for automatic computation of translation candidates based on such alignment. We then display word sketches for the top translation candidate, with aligned grammatical relations, and also show aligned segments with usages for individual collocations. An example parallel word sketch is provided in Figure 5.
    250201
    251202[[BR]][[BR]]Figure 5: Bilingual parallel word sketch
    252203
    253 [[BR]]
    254204 2. Bilingual comparable word sketch (BIC)
    255205
    256 
    257 
    258206In case of two comparable corpora, i.e. corpora which are not parallel but roughly of same size and same text types, we can leverage the translation candidates computer for the same pair of languages on two parallel corpora as in the first case, and proceed similarly, except we cannot show any aligned segments obviously.
    259207
    260 [[BR]]
    261208 3. Bilingual manual word sketch (BIM)
    262209
    263 
    264 
    265210In the last case, the user is responsible for providing the translation manually, and we then simply show the bilingual word sketch for the given translation.
    266211
    267212[[BR]]Figure 4: Multiword sketch for ''young man''
    268213
    269 [[BR]]
    270214=== Longest-commonest match ===
    271215In some cases it is not very clear how the collocation is used with the headword. In simple cases where the collocation is an adjective and the headword is a noun, the whole phrase is immediately clear (''young man''), however in many others it is much more difficult: consider headword ''resource'' and collocation ''management'' in the relation ''modifies''. Reviewing all the concordance lines actually reveals that the most common (hence longest-commonest) match is ''human resources management''.