| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by schlupa 1835 days ago
	That's similar to what I do on our translation memory at the Commission. The issue we have is that we search for sentences, not words and the medium length of sentences in the database is around 120 characters and we have around 1.5 billion sentences in the database. A pairwise Levenshtein would be completely impossible as added to that we have to take care of replaceables in the segments (dates, numbers, months, weekdays, etc). To accelerate the search we use fuzzy keys which are trigram counts based and have them organized in a ternary tree. For the fuzzy distance, a simple difference calculation between 2 fuzzy keys is close enough to Levenhstein distance that we don't need more fancy metrics (for short sentences it is relatively bad but short sentences are mostly irrelevant for translation memories). Our fuzzy index reduces our search space for the Levenshtein distance calculation by 4 to 5 order of magnitudes (a sentence search is done on a space of 100K-300K sentences, after filtering the number of candidates rarely go beyond 100).

1 comments

CornCobs 1835 days ago

Interesting! What I noticed when approaching the problem was that there is quite little information on scaling up. I also don't think there are good out-of-the-box solutions covering a wide range of use cases. Dedup (basically cross-product) and linkage (highly dependent on the relative sizes of your search set and backing data) have very different optimizations when your data is large

link