Context Navigation

Changes between Version 2 and Version 3 of WordSketchGrammars

Timestamp:: Jan 17, 2016, 10:18:34 PM (9 years ago)
Author:: xkocinc
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

WordSketchGrammars

-                      v2
+                      v3
 = D1.1.3 Specification of word-sketch grammars and tools =
 == The concept of a Word Sketch == #docs-internal-guid-33ee83b6-447a-e7a2-1eff-7d803eb0fa06
 A word sketch is a “one-page automatic, corpus-based summaries of a word’s grammatical and collocational behaviour” (Kilgarriff, 2004) and is a key feature of the Sketch Engine corpus management system (Kilgarriff, 2004, 2014). Word sketches have been introduced as a practical concept in computer lexicography to help lexicographers learn about how words behave in particular grammatical relations. They give a quick response to basic questions like
+A word sketch is a “one-page automatic, corpus-based summaries of a word’s grammatical and collocational behaviour” (Kilgarriff, 2004) and is a key feature of the Sketch Engine corpus management system (Kilgarriff, 2004, 2014). Word sketches have been introduced as a practical concept in computer lexicography to help lexicographers learn about how words behave in particular grammatical relations. They give a quick response to basic questions like
 [[BR]]''What are the most salient subjects of a given verb?''
 …
 [[BR]]The key properties of a word sketch, as given in the definition are:
-[[BR]]
  * they are data-driven, i.e. corpus-based, with links back to the underlying corpus evidence, i.e. allowing inspection of particular concordance lines which were the source for the computation
  * they are produced in a fully automatic way, based on a hybrid rule-based and statistical approach (as explained further)
  * they are clustered into meaningful language- and task-dependent relations, usually corresponding to the syntactic functions of the collocates
  * they are easy to read, in a tabular format, with every column representing one grammatical relation (see Figure 1)[[BR]]
 Figure 1: Example word sketch table from the SkELL
 [[BR]]From a syntax theory point of view, word sketches represent an instance of the dependency syntax formalism allowing ambiguity, i.e. forming a dependency multigraph instead of a dependency tree. Similarly to a classical dependency relation, a word sketch relation is a binary relation, which:
-[[BR]]
  * might be reflexive (reflexive relations are called “constructions”)
  * might be symmetric
  * is not transitive
  * is labeled (i.e. named)
  * is scored
 [[BR]]Given that the relation is labeled, items of the relation are usually referred to as word sketch triples consisting of a headword, a relation name and a collocation, such as (resource; modifier; scarce). Such a triple is also assigned a score (a floating point number) representing the strength of the association between headword and collocation where particular semantics depends on the association measure that has been used to compute the score, as explained later.
 …
 The word sketch computation module in Manatee is based on the concept of a word sketch grammar, a plain text file specifying multiple word sketch relations as regular expressions over corpus positions, usually exploiting morphological annotation if the corpus. Definitions of word sketch relations are actually standard corpus queries written in the Corpus Query Language (Jakubíček, 2010) which is used in Manatee. An example of a simple word sketch relation definition is given below:
 [[BR]]|| =modifier[[BR]][[BR]] 2:[tag=”ADJ”] 1:[tag=”NOUN”][[BR]][[BR]] 2:[word=”.*ing”] 1:[tag=”NOUN”] ||
+|| =modifier[[BR]][[BR]] 2:[tag=”ADJ”] 1:[tag=”NOUN”][[BR]][[BR]] 2:[word=”.*ing”] 1:[tag=”NOUN”] ||
 [[BR]]In this example, the modifier relation is defined using two CQL expressions (combined as logical OR) as any adjectives or -ing forms preceding a noun. In the CQL expression, the headword is always prefixed with a “1:” label and the collocation with a “2:” label. In the word sketch computation module, queries for all relations are executed over an already indexed corpus in Manatee, finally yielding the word sketch quintuples which are then subject to indexation.
 …
  * STRUCTLIMIT - if set to a corpus structure such as paragraph or sentence, all query matching following this directive throughout the whole grammar will stop at structure boundaries.
  * DEFAULTATTR - specifies default corpus attribute for all following queries in the sketch grammar.
  * DUAL - the name of the relation following this directive is split on a slash character, where first part of the name corresponds to the name of the relation as given, and the second part names a dual relation with headword and collocation interchanged.
  * SYMMETRIC - assumes that the relation following this directive is symmetric. In other words, SYMMETRIC is an instance of DUAL where both relations have the same name.
  * COLLOC - this directive allows forming the collocation string from multiple query parts (labeled with numbers as seen in the example.
 [[BR]]The following example sketch grammar facilitates all the directives listed above:
 [[BR]]|| *STRUCTLIMIT s[[BR]][[BR]]*DEFAULTATTR tag[[BR]][[BR]][[BR]]*DUAL[[BR]][[BR]]=modifier/modifies[[BR]][[BR]]  2:"ADJ"  "AD[JV]"{0,3} 1:"N"[[BR]][[BR]]  1:"N"  "ADJ"? 2:"ADJ"[[BR]][[BR]]  1:"V.*" 2:"ADV"[[BR]][[BR]]  2:"ADV" 1:"V.*"[[BR]][[BR]][[BR]]*DUAL[[BR]][[BR]]=object/object_of[[BR]][[BR]]  1:"V.*" "ADJ|ADV|DET"{0,3} 2:"N"[[BR]][[BR]][[BR]]*SYMMETRIC[[BR]][[BR]]=and_or[[BR]][[BR]]  1:"N" "ADJ.?"{0,3} [lemma="et|ou"] "DET.*"? "AD[JV]"{0,3} 2:"N"[[BR]][[BR]]  1:"V.*" [lemma="et|ou"] [lemma="de|être|avoir"|tag="ADV"]? 2:"V.*"[[BR]][[BR]]  1:"ADJ" "NOM" 2:"ADJ"[[BR]][[BR]][[BR]]=prep_phrase[[BR]][[BR]]*COLLOC "%(3.word)_%(2.word)-p"[[BR]][[BR]]  1:[tag="N.."|tag="AJ."] 3:"PR." 2:"N.." ||
+|| *STRUCTLIMIT s[[BR]][[BR]]*DEFAULTATTR tag[[BR]][[BR]][[BR]]*DUAL[[BR]][[BR]]=modifier/modifies[[BR]][[BR]]  2:"ADJ"  "AD[JV]"{0,3} 1:"N"[[BR]][[BR]]  1:"N"  "ADJ"? 2:"ADJ"[[BR]][[BR]]  1:"V.*" 2:"ADV"[[BR]][[BR]]  2:"ADV" 1:"V.*"[[BR]][[BR]][[BR]]*DUAL[[BR]][[BR]]=object/object_of[[BR]][[BR]]  1:"V.*" "ADJ|ADV|DET"{0,3} 2:"N"[[BR]][[BR]][[BR]]*SYMMETRIC[[BR]][[BR]]=and_or[[BR]][[BR]]  1:"N" "ADJ.?"{0,3} [lemma="et|ou"] "DET.*"? "AD[JV]"{0,3} 2:"N"[[BR]][[BR]]  1:"V.*" [lemma="et|ou"] [lemma="de|être|avoir"|tag="ADV"]? 2:"V.*"[[BR]][[BR]]  1:"ADJ" "NOM" 2:"ADJ"[[BR]][[BR]][[BR]]=prep_phrase[[BR]][[BR]]*COLLOC "%(3.word)_%(2.word)-p"[[BR]][[BR]]  1:[tag="N.."|tag="AJ."] 3:"PR." 2:"N.." ||
 [[BR]][[BR]]A full documentation to the sketch grammar formalism is available online.
 …
 [[BR]]So far sketch grammars for 26 languages have been devised during Sketch Engine development.
-[[BR]]
 === Alternative Approaches to Word Sketch Computation ===
 As we announced earlier, the word sketch generation module in Manatee can be easily replaced by any other tool for generating dependency triples. In particular, one can easily exploit any existing dependency parser for generating word sketches, as we have shown in (Kilgarriff, 2014). Moreover, since word sketches do not require unambiguous representation of a text/sentence, they can be generated also from dependency graphs as was demonstrated in (Horák, 2009).
 …
 The word sketch indexing takes word sketch quintuples as input and stores them in an efficient indexing structure which is part of Manatee and is designed to enable fast execution of following queries:
-[[BR]]
  * list of all headwords in the database
  * for a given headword, what are the associated relations
  * for a given headword and relation, what are the associated collocations
  * for a given headword, relation and collocation, what are the corpus occurrences (positions) of headword and collocation
  * for a given headword, relation and collocation, what is the number of corpus occurrences of headword and collocation
  * for a given headword, relation and collocation, what is the associated score
-[[BR]]
 == Word Sketch Scoring ==
 On top of an indexed word sketch database, the statistical components of Manatee finally perform computation of the score for each word sketch triple. A number of lexicographic association measures is implemented in the scoring module (such as T-score, MI-score etc.) with the default being a modification of the Dice coefficient called the logDice association score (Rychlý, 2008) defined for a given headword w1, relation Rand collocation w2as:
 …
 The word sketch machinery as described above is all part of Manatee libraries written in C++. From a user perspective, each of the modules is controlled with single binary executable file used from Linux command line:
-[[BR]]
  * word sketch generation: genws
  * word sketch indexing: mkwmap
  * word sketch scoring: mkwmrank
 [[BR]]In the following we describe the usage and interaction between these three tools.
 …
 genws takes four parameters:
-[[BR]]
  * the name of the corpus (indexed in Manatee databases)
  * the name of the corpus positional attribute which will be used for the lookup of headwords and collocates (usually a lemma, word form, or combination of lemma and part-of-speech abbreviation called a lempos)
  * the output data path
  * the word sketch grammar file (having .wsdef extension by convention)
 [[BR]]The tools sequentially processes the given word sketch grammar file, evaluates all queries and send word sketch quintuples to its standard output as binary 64 bit integers. Its output can be inspected with standard Unix tools such as od:
 [[BR]]genws bnc lemma /out/wsdb english.wsdef | od -t d8 -w40
 [[BR]]0000000     24            0           27           27           30
 0000050     27            0           24           30           27
 0000120     41            0           42           49           51
 0000170     42            0           41           51           49
 0000240     36            0           57           76           80
+[[BR]]0000000     24            0           27           27           30
+0000050     27            0           24           30           27
+0000120     41            0           42           49           51
+0000170     42            0           41           51           49
+0000240     36            0           57           76           80
 ![...]
 …
 The output of genws is normally directly passed to mkwmap which reads the word sketch quintuples from standard input and proceeds in three phases:
-[[BR]]
 . sorting of the word sketch quintuples (numerically by headword, relation, and collocation) in a number of batches so that they fit into the computer memory
 . joining of sorted batches
 . writing the indices on disk
 [[BR]]mkwmap takes a single obligatory argument pointing to the same data path as was used in genws. An example call would be:
 …
 [[BR]]In natural language processing often parallelization can be done in the most trivial way: by splitting data to be processed into N parts and running N independent tasks (see Figure 2). However, in case of a corpus management system, this is often not possible because of the underlying string-to-number mapping which needs to be consistent across a single corpus, and hence shared during parallel processing. As such, it easily represents a bottleneck severely limiting potential speedup (see Figure 3). In (Jakubíček, 2014) we describe a general mechanism which deals with this issue and is called corpus virtualization.
 Figure 3: Shared lexicon as a parallelization bottleneck
+Figure 3: Shared lexicon as a parallelization bottleneck
 === Space complexity ===
 In 2015 a major overhaul of the indexing format has been performed in order to improve space efficiency. In the following table we provide a comparison of disk space occupied by indices in the two most recent word sketch formats:
 [[BR]]|| corpus || word sketch format 3[[BR]](since 2011) || word sketch format 4[[BR]](since 2015) ||
+|| corpus || word sketch format 3[[BR]](since 2011) || word sketch format 4[[BR]](since 2015) ||
 || enTenTen13[[BR]](~22 billion tokens, Jakubíček et al., 2013) || ~370 GB || ~90 GB ||
-[[BR]]
 == Qualitative Evaluation ==
 Besides technical aspects of word sketches processing, a very important research question is the one of their quality. Since 2012 this has been subject to extensive research and is reported as a separate deliverable in D4.1: Methodology of Sketch Grammar evaluation.
 …
 Originally word sketch headwords used to be always single word units -- depending on particular tokenization of the underlying corpus. In (Kilgarriff, 2012) the word sketch functionality has been extended towards multiword units. The approach taken is scalable in terms that the multiword unit length is not limited and only depends on the corpus evidence that can be found.
 [[BR]]Technically multiword sketches are “filtered sketches”, e.g. for a noun phrase ''young man'', one retrieves word sketches for ''man'', looks up ''young'' among its collocates, and filters the word sketches only for corpus positions where ''young'' a collocate of ''man'' (in any relation). The same process is repeated starting from ''young'' with ''man'' as its collocate.
+[[BR]]Technically multiword sketches are “filtered sketches”, e.g. for a noun phrase ''young man'', one retrieves word sketches for ''man'', looks up ''young'' among its collocates, and filters the word sketches only for corpus positions where ''young'' a collocate of ''man'' (in any relation). The same process is repeated starting from ''young'' with ''man'' as its collocate.
 [[BR]]The resulting multiword sketch is depicted in Figure 4. Note that this process can be repeated to so as to retrieve word sketch for e.g. ''handsome young man''.
 …
 Another extension to the word sketch functionality relates to the application of word sketches on parallel corpora and bilingual texts in general, e.g. to perform a contrastive collocational analysis of a word and its translation equivalent. This can be done in “three flavours”, all of which are described in detail in (Baisa, 2014):
-[[BR]]
 . Bilingual parallel word sketch (BIP)
 In case of two parallel corpora, i.e. corpora with existing alignment on sentence or paragraph level, we have devised algorithms for automatic computation of translation candidates based on such alignment. We then display word sketches for the top translation candidate, with aligned grammatical relations, and also show aligned segments with usages for individual collocations. An example parallel word sketch is provided in Figure 5.
 [[BR]][[BR]]Figure 5: Bilingual parallel word sketch
-[[BR]]
 . Bilingual comparable word sketch (BIC)
 In case of two comparable corpora, i.e. corpora which are not parallel but roughly of same size and same text types, we can leverage the translation candidates computer for the same pair of languages on two parallel corpora as in the first case, and proceed similarly, except we cannot show any aligned segments obviously.
-[[BR]]
 . Bilingual manual word sketch (BIM)
 In the last case, the user is responsible for providing the translation manually, and we then simply show the bilingual word sketch for the given translation.
 [[BR]]Figure 4: Multiword sketch for ''young man''
-[[BR]]
 === Longest-commonest match ===
 In some cases it is not very clear how the collocation is used with the headword. In simple cases where the collocation is an adjective and the headword is a noun, the whole phrase is immediately clear (''young man''), however in many others it is much more difficult: consider headword ''resource'' and collocation ''management'' in the relation ''modifies''. Reviewing all the concordance lines actually reveals that the most common (hence longest-commonest) match is ''human resources management''.