|
|
|
|
|
by wolfgarbe
2382 days ago
|
|
[Disclaimer: I'm the author of SymSpell]
>>"SymSpell is great for tiny datasets".
Dictionaries with one million word entries are no problem, even for a maximum edit distance of 4. For comparison, the Oxford English Dictionary contains 301,100 main entries and 616,500 word-forms in total.
https://towardsdatascience.com/symspell-vs-bk-tree-100x-fast... SymSpell can be augmented with a Weighted Edit distance giving a higher priority to pairs that are close to each other on the keyboard layout or which sound similar (e.g. Soundex or other phonetic algorithms which identify different spellings of the same sound).
There are two SymSpell implementations with a Weighted Edit distance available:
https://github.com/MighTguY/customized-symspell
https://github.com/searchhub/preDict |
|
So we used 3 indices (one for unigrams, bigrams, and trigrams respectively) -- and during query processing we would segment the input query and for each segment we ‘d consult any of those 3 indices that made sense and would keep the top-K ones, and then we ‘d proceed to the next segment and we ‘d consider the n-gram sequence of the “suffix” and “prefix” matches between the “carried” top-K from the previous segment and the current/next segment, and so on, until we have exhausted the query. Segmentation was particularly hard to get right.
It was a fairly involved process but it was what worked for us -- again, SymSpell is _very_ fast for short tokens, but we had to execute maybe 1000s of such SymSpell queries when processing an input query and it adds up quickly. (We will probably open-source our SymSpell implementation soon).