|
|
|
|
|
by markpapadakis
2382 days ago
|
|
SymSpell is great for tiny datasets(based on edit distance/operations, and dictionary size).
It’s fast, but the generated index will likely be very large; furthermore, you can’t practically encode rules and transition costs (e.g cost from “f” to “ph” being zero for getting better results for [ifone]). An FST is the better choice. We (https://www.bestprice.gr/) used SymSpell in the past, but have since switched to an FST based design, where the FST is used to match prefixes. We also store all queries encoded in the FST, together with their frequency, sorted by the query. For each distinct corrected prefix, we determine the range in that (query, frequency) list using binary search, and then just iterate in that list from [first, last] in that range, and compute a score based on a language model and the edit cost(prefix to corrected prefix) (i.e noisy channel). We encode many different rules in the FST including language domain ones, and building the index takes a few seconds for millions of queries. The language model is based on 5-grams and Stupid Backoff. |
|
SymSpell can be augmented with a Weighted Edit distance giving a higher priority to pairs that are close to each other on the keyboard layout or which sound similar (e.g. Soundex or other phonetic algorithms which identify different spellings of the same sound). There are two SymSpell implementations with a Weighted Edit distance available: https://github.com/MighTguY/customized-symspell https://github.com/searchhub/preDict