|
|
|
|
|
by schlupa
1835 days ago
|
|
That's similar to what I do on our translation memory at the Commission. The issue we have is that we search for sentences, not words and the medium length of sentences in the database is around 120 characters and we have around 1.5 billion sentences in the database. A pairwise Levenshtein would be completely impossible as added to that we have to take care of replaceables in the segments (dates, numbers, months, weekdays, etc).
To accelerate the search we use fuzzy keys which are trigram counts based and have them organized in a ternary tree. For the fuzzy distance, a simple difference calculation between 2 fuzzy keys is close enough to Levenhstein distance that we don't need more fancy metrics (for short sentences it is relatively bad but short sentences are mostly irrelevant for translation memories). Our fuzzy index reduces our search space for the Levenshtein distance calculation by 4 to 5 order of magnitudes (a sentence search is done on a space of 100K-300K sentences, after filtering the number of candidates rarely go beyond 100). |
|