| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by markpapadakis 2382 days ago

SymSpell is great for tiny datasets(based on edit distance/operations, and dictionary size). It’s fast, but the generated index will likely be very large; furthermore, you can’t practically encode rules and transition costs (e.g cost from “f” to “ph” being zero for getting better results for [ifone]).

An FST is the better choice. We (https://www.bestprice.gr/) used SymSpell in the past, but have since switched to an FST based design, where the FST is used to match prefixes. We also store all queries encoded in the FST, together with their frequency, sorted by the query.

For each distinct corrected prefix, we determine the range in that (query, frequency) list using binary search, and then just iterate in that list from [first, last] in that range, and compute a score based on a language model and the edit cost(prefix to corrected prefix) (i.e noisy channel).

We encode many different rules in the FST including language domain ones, and building the index takes a few seconds for millions of queries. The language model is based on 5-grams and Stupid Backoff.

2 comments

wolfgarbe 2382 days ago

[Disclaimer: I'm the author of SymSpell] >>"SymSpell is great for tiny datasets". Dictionaries with one million word entries are no problem, even for a maximum edit distance of 4. For comparison, the Oxford English Dictionary contains 301,100 main entries and 616,500 word-forms in total. https://towardsdatascience.com/symspell-vs-bk-tree-100x-fast...

SymSpell can be augmented with a Weighted Edit distance giving a higher priority to pairs that are close to each other on the keyboard layout or which sound similar (e.g. Soundex or other phonetic algorithms which identify different spellings of the same sound). There are two SymSpell implementations with a Weighted Edit distance available: https://github.com/MighTguY/customized-symspell https://github.com/searchhub/preDict

link

markpapadakis 2382 days ago

For a few million queries we needed to index, many of them close to 30 characters in length(some even longer than that), the generated index size for max edit distance 3, was really large.

So we used 3 indices (one for unigrams, bigrams, and trigrams respectively) -- and during query processing we would segment the input query and for each segment we ‘d consult any of those 3 indices that made sense and would keep the top-K ones, and then we ‘d proceed to the next segment and we ‘d consider the n-gram sequence of the “suffix” and “prefix” matches between the “carried” top-K from the previous segment and the current/next segment, and so on, until we have exhausted the query. Segmentation was particularly hard to get right.

It was a fairly involved process but it was what worked for us -- again, SymSpell is _very_ fast for short tokens, but we had to execute maybe 1000s of such SymSpell queries when processing an input query and it adds up quickly. (We will probably open-source our SymSpell implementation soon).

link

ssubu 2382 days ago

[Disclaimer: work at Cliqz] Both techniques have their share of upsides and downsides, infact we also use an FST based model to perform splits i.e donaldtrump ---> donald trump. The problem with the FST approach is when the prefix has a typo. I tried the spell correct on the website you linked to, it seems to fail when the first character is incorrect. Thanks for the information though, it is always interesting to read and learn from other approaches :)

link

markpapadakis 2382 days ago

Could you please tell me what you entered?

As for prefixes and typos, in our case, the FST is used for both matching a prefix and spelling correction of the prefix.

The first character is treated specially; the chance of getting the first character wrong is very low(according to our findings) so only completions/corrections that score very high get to be matched.

link

ssubu 2382 days ago

Sure. cmputer gets corrected, not omputer.

Just to provide some additional information, our library Keyvi(https://github.com/KeyviDev/Keyvi) has a very fast implementation of an FST based spell correct.

link

markpapadakis 2382 days ago

Got it. This is by design - though I suspect we need to dial down the cost for misspelled first character in prefixes.

We built our own FST, which is somewhat similar to the design of the Rust fst crate ( https://blog.burntsushi.net/transducers/#fst-construction ).

link