Changes between Version 5 and Version 6 of WordSketchGrammars
- Timestamp:
- Jan 17, 2016, 10:25:59 PM (8 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
WordSketchGrammars
v5 v6 168 168 [[BR]]The most consuming part if word sketch computation, which depends on the number of relations and complexity of the corpus queries. The indexation phase is not very CPU intensive, log-linear with regard to the number of quintuples due to the sorting, and in terms of speed mainly dependent on the speed of the underlying storage performing I/O operations. The scoring phase is very fast since it facilitates existing indices, and usually takes less than 0.5 % time of the overall computation. 169 169 170 [[BR]]Figure 2: Trivial parallelization 170 [[BR]] 171 [[Image(fig2.png)]] 172 Figure 2: Trivial parallelization 171 173 172 174 [[BR]]From this follows that clearly any efforts towards speeding up the computation should be devoted to the computation, i.e. query evaluation phase. In (Pomikálek, 2012) we have shown a parallelization approach which, depending on the structure of the word sketch relation, allows close to a linear speed up of this phase with regard to the number of processing cores used. 173 175 174 176 [[BR]]In natural language processing often parallelization can be done in the most trivial way: by splitting data to be processed into N parts and running N independent tasks (see Figure 2). However, in case of a corpus management system, this is often not possible because of the underlying string-to-number mapping which needs to be consistent across a single corpus, and hence shared during parallel processing. As such, it easily represents a bottleneck severely limiting potential speedup (see Figure 3). In (Jakubíček, 2014) we describe a general mechanism which deals with this issue and is called corpus virtualization. 177 178 [[Image(fig3.png)]] 175 179 176 180 Figure 3: Shared lexicon as a parallelization bottleneck … … 202 206 In case of two parallel corpora, i.e. corpora with existing alignment on sentence or paragraph level, we have devised algorithms for automatic computation of translation candidates based on such alignment. We then display word sketches for the top translation candidate, with aligned grammatical relations, and also show aligned segments with usages for individual collocations. An example parallel word sketch is provided in Figure 5. 203 207 204 [[BR]][[BR]]Figure 5: Bilingual parallel word sketch 208 [[BR]][[BR]] 209 [[Image(fig5.png)]] 210 Figure 5: Bilingual parallel word sketch 205 211 206 212 2. Bilingual comparable word sketch (BIC) … … 212 218 In the last case, the user is responsible for providing the translation manually, and we then simply show the bilingual word sketch for the given translation. 213 219 214 [[BR]]Figure 4: Multiword sketch for ''young man'' 220 [[BR]] 221 [[Image(fig4.png)]] 222 Figure 4: Multiword sketch for ''young man'' 215 223 216 224 === Longest-commonest match ===