Changes between Version 2 and Version 3 of WordSketchGrammars
- Timestamp:
- Jan 17, 2016, 10:18:34 PM (9 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
WordSketchGrammars
v2 v3 1 1 = D1.1.3 Specification of word-sketch grammars and tools = 2 2 == The concept of a Word Sketch == #docs-internal-guid-33ee83b6-447a-e7a2-1eff-7d803eb0fa06 3 A word sketch is a “one-page automatic, corpus-based summaries of a word’s grammatical and collocational behaviour” (Kilgarriff, 2004) and is a key feature of the Sketch Engine corpus management system (Kilgarriff, 2004, 2014). Word sketches have been introduced as a practical concept in computer lexicography to help lexicographers learn about how words behave in particular grammatical relations. They give a quick response to basic questions like 3 A word sketch is a “one-page automatic, corpus-based summaries of a word’s grammatical and collocational behaviour” (Kilgarriff, 2004) and is a key feature of the Sketch Engine corpus management system (Kilgarriff, 2004, 2014). Word sketches have been introduced as a practical concept in computer lexicography to help lexicographers learn about how words behave in particular grammatical relations. They give a quick response to basic questions like 4 4 5 5 [[BR]]''What are the most salient subjects of a given verb?'' … … 11 11 [[BR]]The key properties of a word sketch, as given in the definition are: 12 12 13 [[BR]]14 13 * they are data-driven, i.e. corpus-based, with links back to the underlying corpus evidence, i.e. allowing inspection of particular concordance lines which were the source for the computation 15 14 16 17 15 * they are produced in a fully automatic way, based on a hybrid rule-based and statistical approach (as explained further) 18 16 19 20 17 * they are clustered into meaningful language- and task-dependent relations, usually corresponding to the syntactic functions of the collocates 21 18 22 23 19 * they are easy to read, in a tabular format, with every column representing one grammatical relation (see Figure 1)[[BR]] 24 20 25 26 27 21 Figure 1: Example word sketch table from the SkELL 28 22 29 23 [[BR]]From a syntax theory point of view, word sketches represent an instance of the dependency syntax formalism allowing ambiguity, i.e. forming a dependency multigraph instead of a dependency tree. Similarly to a classical dependency relation, a word sketch relation is a binary relation, which: 30 24 31 [[BR]]32 25 * might be reflexive (reflexive relations are called “constructions”) 33 26 34 35 27 * might be symmetric 36 28 37 38 29 * is not transitive 39 30 40 41 31 * is labeled (i.e. named) 42 32 43 44 33 * is scored 45 46 47 34 48 35 [[BR]]Given that the relation is labeled, items of the relation are usually referred to as word sketch triples consisting of a headword, a relation name and a collocation, such as (resource; modifier; scarce). Such a triple is also assigned a score (a floating point number) representing the strength of the association between headword and collocation where particular semantics depends on the association measure that has been used to compute the score, as explained later. … … 60 47 The word sketch computation module in Manatee is based on the concept of a word sketch grammar, a plain text file specifying multiple word sketch relations as regular expressions over corpus positions, usually exploiting morphological annotation if the corpus. Definitions of word sketch relations are actually standard corpus queries written in the Corpus Query Language (Jakubíček, 2010) which is used in Manatee. An example of a simple word sketch relation definition is given below: 61 48 62 [[BR]]|| =modifier[[BR]][[BR]] 2:[tag=”ADJ”] 1:[tag=”NOUN”][[BR]][[BR]] 2:[word=”.*ing”] 1:[tag=”NOUN”] ||49 || =modifier[[BR]][[BR]] 2:[tag=”ADJ”] 1:[tag=”NOUN”][[BR]][[BR]] 2:[word=”.*ing”] 1:[tag=”NOUN”] || 63 50 64 51 [[BR]]In this example, the modifier relation is defined using two CQL expressions (combined as logical OR) as any adjectives or -ing forms preceding a noun. In the CQL expression, the headword is always prefixed with a “1:” label and the collocation with a “2:” label. In the word sketch computation module, queries for all relations are executed over an already indexed corpus in Manatee, finally yielding the word sketch quintuples which are then subject to indexation. … … 68 55 * STRUCTLIMIT - if set to a corpus structure such as paragraph or sentence, all query matching following this directive throughout the whole grammar will stop at structure boundaries. 69 56 70 71 57 * DEFAULTATTR - specifies default corpus attribute for all following queries in the sketch grammar. 72 58 73 74 59 * DUAL - the name of the relation following this directive is split on a slash character, where first part of the name corresponds to the name of the relation as given, and the second part names a dual relation with headword and collocation interchanged. 75 60 76 77 61 * SYMMETRIC - assumes that the relation following this directive is symmetric. In other words, SYMMETRIC is an instance of DUAL where both relations have the same name. 78 62 79 80 63 * COLLOC - this directive allows forming the collocation string from multiple query parts (labeled with numbers as seen in the example. 81 64 82 83 84 65 [[BR]]The following example sketch grammar facilitates all the directives listed above: 85 66 86 [[BR]]|| *STRUCTLIMIT s[[BR]][[BR]]*DEFAULTATTR tag[[BR]][[BR]][[BR]]*DUAL[[BR]][[BR]]=modifier/modifies[[BR]][[BR]] 2:"ADJ" "AD[JV]"{0,3} 1:"N"[[BR]][[BR]] 1:"N" "ADJ"? 2:"ADJ"[[BR]][[BR]] 1:"V.*" 2:"ADV"[[BR]][[BR]] 2:"ADV" 1:"V.*"[[BR]][[BR]][[BR]]*DUAL[[BR]][[BR]]=object/object_of[[BR]][[BR]] 1:"V.*" "ADJ|ADV|DET"{0,3} 2:"N"[[BR]][[BR]][[BR]]*SYMMETRIC[[BR]][[BR]]=and_or[[BR]][[BR]] 1:"N" "ADJ.?"{0,3} [lemma="et|ou"] "DET.*"? "AD[JV]"{0,3} 2:"N"[[BR]][[BR]] 1:"V.*" [lemma="et|ou"] [lemma="de|être|avoir"|tag="ADV"]? 2:"V.*"[[BR]][[BR]] 1:"ADJ" "NOM" 2:"ADJ"[[BR]][[BR]][[BR]]=prep_phrase[[BR]][[BR]]*COLLOC "%(3.word)_%(2.word)-p"[[BR]][[BR]]1:[tag="N.."|tag="AJ."] 3:"PR." 2:"N.." ||67 || *STRUCTLIMIT s[[BR]][[BR]]*DEFAULTATTR tag[[BR]][[BR]][[BR]]*DUAL[[BR]][[BR]]=modifier/modifies[[BR]][[BR]] 2:"ADJ" "AD[JV]"{0,3} 1:"N"[[BR]][[BR]] 1:"N" "ADJ"? 2:"ADJ"[[BR]][[BR]] 1:"V.*" 2:"ADV"[[BR]][[BR]] 2:"ADV" 1:"V.*"[[BR]][[BR]][[BR]]*DUAL[[BR]][[BR]]=object/object_of[[BR]][[BR]] 1:"V.*" "ADJ|ADV|DET"{0,3} 2:"N"[[BR]][[BR]][[BR]]*SYMMETRIC[[BR]][[BR]]=and_or[[BR]][[BR]] 1:"N" "ADJ.?"{0,3} [lemma="et|ou"] "DET.*"? "AD[JV]"{0,3} 2:"N"[[BR]][[BR]] 1:"V.*" [lemma="et|ou"] [lemma="de|être|avoir"|tag="ADV"]? 2:"V.*"[[BR]][[BR]] 1:"ADJ" "NOM" 2:"ADJ"[[BR]][[BR]][[BR]]=prep_phrase[[BR]][[BR]]*COLLOC "%(3.word)_%(2.word)-p"[[BR]][[BR]] 1:[tag="N.."|tag="AJ."] 3:"PR." 2:"N.." || 87 68 88 69 [[BR]][[BR]]A full documentation to the sketch grammar formalism is available online. … … 90 71 [[BR]]So far sketch grammars for 26 languages have been devised during Sketch Engine development. 91 72 92 [[BR]]93 73 === Alternative Approaches to Word Sketch Computation === 94 74 As we announced earlier, the word sketch generation module in Manatee can be easily replaced by any other tool for generating dependency triples. In particular, one can easily exploit any existing dependency parser for generating word sketches, as we have shown in (Kilgarriff, 2014). Moreover, since word sketches do not require unambiguous representation of a text/sentence, they can be generated also from dependency graphs as was demonstrated in (Horák, 2009). … … 99 79 The word sketch indexing takes word sketch quintuples as input and stores them in an efficient indexing structure which is part of Manatee and is designed to enable fast execution of following queries: 100 80 101 [[BR]]102 81 * list of all headwords in the database 103 82 104 105 83 * for a given headword, what are the associated relations 106 84 107 108 85 * for a given headword and relation, what are the associated collocations 109 86 110 111 87 * for a given headword, relation and collocation, what are the corpus occurrences (positions) of headword and collocation 112 88 113 114 89 * for a given headword, relation and collocation, what is the number of corpus occurrences of headword and collocation 115 90 116 117 91 * for a given headword, relation and collocation, what is the associated score 118 92 119 120 121 [[BR]]122 93 == Word Sketch Scoring == 123 94 On top of an indexed word sketch database, the statistical components of Manatee finally perform computation of the score for each word sketch triple. A number of lexicographic association measures is implemented in the scoring module (such as T-score, MI-score etc.) with the default being a modification of the Dice coefficient called the logDice association score (Rychlý, 2008) defined for a given headword w1, relation Rand collocation w2as: … … 130 101 The word sketch machinery as described above is all part of Manatee libraries written in C++. From a user perspective, each of the modules is controlled with single binary executable file used from Linux command line: 131 102 132 [[BR]]133 103 * word sketch generation: genws 134 104 135 136 105 * word sketch indexing: mkwmap 137 106 138 139 107 * word sketch scoring: mkwmrank 140 141 142 108 143 109 [[BR]]In the following we describe the usage and interaction between these three tools. … … 146 112 genws takes four parameters: 147 113 148 [[BR]]149 114 * the name of the corpus (indexed in Manatee databases) 150 115 151 152 116 * the name of the corpus positional attribute which will be used for the lookup of headwords and collocates (usually a lemma, word form, or combination of lemma and part-of-speech abbreviation called a lempos) 153 117 154 155 118 * the output data path 156 119 157 158 120 * the word sketch grammar file (having .wsdef extension by convention) 159 121 160 161 162 122 [[BR]]The tools sequentially processes the given word sketch grammar file, evaluates all queries and send word sketch quintuples to its standard output as binary 64 bit integers. Its output can be inspected with standard Unix tools such as od: 163 123 164 124 [[BR]]genws bnc lemma /out/wsdb english.wsdef | od -t d8 -w40 165 125 166 [[BR]]0000000 24 0 27 2730167 168 0000050 27 0 24 3027169 170 0000120 41 0 42 4951171 172 0000170 42 0 41 5149173 174 0000240 36 0 57 7680126 [[BR]]0000000 24 0 27 27 30 127 128 0000050 27 0 24 30 27 129 130 0000120 41 0 42 49 51 131 132 0000170 42 0 41 51 49 133 134 0000240 36 0 57 76 80 175 135 176 136 ![...] … … 181 141 The output of genws is normally directly passed to mkwmap which reads the word sketch quintuples from standard input and proceeds in three phases: 182 142 183 [[BR]]184 143 1. sorting of the word sketch quintuples (numerically by headword, relation, and collocation) in a number of batches so that they fit into the computer memory 185 144 186 187 145 1. joining of sorted batches 188 146 189 190 147 1. writing the indices on disk 191 192 193 148 194 149 [[BR]]mkwmap takes a single obligatory argument pointing to the same data path as was used in genws. An example call would be: … … 217 172 [[BR]]In natural language processing often parallelization can be done in the most trivial way: by splitting data to be processed into N parts and running N independent tasks (see Figure 2). However, in case of a corpus management system, this is often not possible because of the underlying string-to-number mapping which needs to be consistent across a single corpus, and hence shared during parallel processing. As such, it easily represents a bottleneck severely limiting potential speedup (see Figure 3). In (Jakubíček, 2014) we describe a general mechanism which deals with this issue and is called corpus virtualization. 218 173 219 Figure 3: Shared lexicon as a parallelization bottleneck 174 Figure 3: Shared lexicon as a parallelization bottleneck 220 175 221 176 === Space complexity === 222 177 In 2015 a major overhaul of the indexing format has been performed in order to improve space efficiency. In the following table we provide a comparison of disk space occupied by indices in the two most recent word sketch formats: 223 178 224 [[BR]]|| corpus || word sketch format 3[[BR]](since 2011) || word sketch format 4[[BR]](since 2015) ||179 || corpus || word sketch format 3[[BR]](since 2011) || word sketch format 4[[BR]](since 2015) || 225 180 || enTenTen13[[BR]](~22 billion tokens, Jakubíček et al., 2013) || ~370 GB || ~90 GB || 226 181 227 [[BR]]228 182 == Qualitative Evaluation == 229 183 Besides technical aspects of word sketches processing, a very important research question is the one of their quality. Since 2012 this has been subject to extensive research and is reported as a separate deliverable in D4.1: Methodology of Sketch Grammar evaluation. … … 235 189 Originally word sketch headwords used to be always single word units -- depending on particular tokenization of the underlying corpus. In (Kilgarriff, 2012) the word sketch functionality has been extended towards multiword units. The approach taken is scalable in terms that the multiword unit length is not limited and only depends on the corpus evidence that can be found. 236 190 237 [[BR]]Technically multiword sketches are “filtered sketches”, e.g. for a noun phrase ''young man'', one retrieves word sketches for ''man'', looks up ''young'' among its collocates, and filters the word sketches only for corpus positions where ''young'' a collocate of ''man'' (in any relation). The same process is repeated starting from ''young'' with ''man'' as its collocate. 191 [[BR]]Technically multiword sketches are “filtered sketches”, e.g. for a noun phrase ''young man'', one retrieves word sketches for ''man'', looks up ''young'' among its collocates, and filters the word sketches only for corpus positions where ''young'' a collocate of ''man'' (in any relation). The same process is repeated starting from ''young'' with ''man'' as its collocate. 238 192 239 193 [[BR]]The resulting multiword sketch is depicted in Figure 4. Note that this process can be repeated to so as to retrieve word sketch for e.g. ''handsome young man''. … … 242 196 Another extension to the word sketch functionality relates to the application of word sketches on parallel corpora and bilingual texts in general, e.g. to perform a contrastive collocational analysis of a word and its translation equivalent. This can be done in “three flavours”, all of which are described in detail in (Baisa, 2014): 243 197 244 [[BR]]245 198 1. Bilingual parallel word sketch (BIP) 246 199 247 248 249 200 In case of two parallel corpora, i.e. corpora with existing alignment on sentence or paragraph level, we have devised algorithms for automatic computation of translation candidates based on such alignment. We then display word sketches for the top translation candidate, with aligned grammatical relations, and also show aligned segments with usages for individual collocations. An example parallel word sketch is provided in Figure 5. 250 201 251 202 [[BR]][[BR]]Figure 5: Bilingual parallel word sketch 252 203 253 [[BR]]254 204 2. Bilingual comparable word sketch (BIC) 255 205 256 257 258 206 In case of two comparable corpora, i.e. corpora which are not parallel but roughly of same size and same text types, we can leverage the translation candidates computer for the same pair of languages on two parallel corpora as in the first case, and proceed similarly, except we cannot show any aligned segments obviously. 259 207 260 [[BR]]261 208 3. Bilingual manual word sketch (BIM) 262 209 263 264 265 210 In the last case, the user is responsible for providing the translation manually, and we then simply show the bilingual word sketch for the given translation. 266 211 267 212 [[BR]]Figure 4: Multiword sketch for ''young man'' 268 213 269 [[BR]]270 214 === Longest-commonest match === 271 215 In some cases it is not very clear how the collocation is used with the headword. In simple cases where the collocation is an adjective and the headword is a noun, the whole phrase is immediately clear (''young man''), however in many others it is much more difficult: consider headword ''resource'' and collocation ''management'' in the relation ''modifies''. Reviewing all the concordance lines actually reveals that the most common (hence longest-commonest) match is ''human resources management''.