| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by markpapadakis 2389 days ago

For a few million queries we needed to index, many of them close to 30 characters in length(some even longer than that), the generated index size for max edit distance 3, was really large.

So we used 3 indices (one for unigrams, bigrams, and trigrams respectively) -- and during query processing we would segment the input query and for each segment we ‘d consult any of those 3 indices that made sense and would keep the top-K ones, and then we ‘d proceed to the next segment and we ‘d consider the n-gram sequence of the “suffix” and “prefix” matches between the “carried” top-K from the previous segment and the current/next segment, and so on, until we have exhausted the query. Segmentation was particularly hard to get right.

It was a fairly involved process but it was what worked for us -- again, SymSpell is _very_ fast for short tokens, but we had to execute maybe 1000s of such SymSpell queries when processing an input query and it adds up quickly. (We will probably open-source our SymSpell implementation soon).