Changes between Version 5 and Version 6 of WordSketchGrammars


Ignore:
Timestamp:
Jan 17, 2016, 10:25:59 PM (8 years ago)
Author:
xkocinc
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • WordSketchGrammars

    v5 v6  
    168168[[BR]]The most consuming part if word sketch computation, which depends on the number of relations and complexity of the corpus queries. The indexation phase is not very CPU intensive, log-linear with regard to the number of quintuples due to the sorting, and in terms of speed mainly dependent on the speed of the underlying storage performing I/O operations. The scoring phase is very fast since it facilitates existing indices, and usually takes less than 0.5 % time of the overall computation.
    169169
    170 [[BR]]Figure 2: Trivial parallelization
     170[[BR]]
     171[[Image(fig2.png)]]
     172Figure 2: Trivial parallelization
    171173
    172174[[BR]]From this follows that clearly any efforts towards speeding up the computation should be devoted to the computation, i.e. query evaluation phase. In (Pomikálek, 2012) we have shown a parallelization approach which, depending on the structure of the word sketch relation, allows close to a linear speed up of this phase with regard to the number of processing cores used.
    173175
    174176[[BR]]In natural language processing often parallelization can be done in the most trivial way: by splitting data to be processed into N parts and running N independent tasks (see Figure 2). However, in case of a corpus management system, this is often not possible because of the underlying string-to-number mapping which needs to be consistent across a single corpus, and hence shared during parallel processing. As such, it easily represents a bottleneck severely limiting potential speedup (see Figure 3). In (Jakubíček, 2014) we describe a general mechanism which deals with this issue and is called corpus virtualization.
     177
     178[[Image(fig3.png)]]
    175179
    176180Figure 3: Shared lexicon as a parallelization bottleneck
     
    202206In case of two parallel corpora, i.e. corpora with existing alignment on sentence or paragraph level, we have devised algorithms for automatic computation of translation candidates based on such alignment. We then display word sketches for the top translation candidate, with aligned grammatical relations, and also show aligned segments with usages for individual collocations. An example parallel word sketch is provided in Figure 5.
    203207
    204 [[BR]][[BR]]Figure 5: Bilingual parallel word sketch
     208[[BR]][[BR]]
     209[[Image(fig5.png)]]
     210Figure 5: Bilingual parallel word sketch
    205211
    206212 2. Bilingual comparable word sketch (BIC)
     
    212218In the last case, the user is responsible for providing the translation manually, and we then simply show the bilingual word sketch for the given translation.
    213219
    214 [[BR]]Figure 4: Multiword sketch for ''young man''
     220[[BR]]
     221[[Image(fig4.png)]]
     222Figure 4: Multiword sketch for ''young man''
    215223
    216224=== Longest-commonest match ===