|
|
|
|
|
by wolfgarbe
136 days ago
|
|
Peter Norvig shows that an edit distance = 2 will cover 98.9% spelling errors.
https://impythonist.wordpress.com/2014/03/18/peter-norvigs-2... That's the reason why the default maximum edit distance of SymSpell is 2. Now, all your 6 out of 6 examples are chosen from that 1.1% margin that is not covered by edit distance 2, presenting a rather unlikely high amount of errors within a single word. The third-party SymSpell port from Justin Willaby, which you were using for benchmarking, clearly states that you need to set both maxEditDistance and dictionaryEditDistance to a higher number if you want to correct higher edit distances. That you neither used nor mentioned. This has nothing to do with accuracy; it is a choice regarding a performance vs. maximum edit distance tradeoff one can make according to the use case at hand. https://github.com/justinwilaby/spellchecker-wasm?tab=readme... pronnouncaition IS within edit distance 3, according to the Damerau-Levenshtein edit distance used by SymSpell. The reason is that adjacent transpositions are counted as a single dit.
https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_di... |
|
Lexiathan also doesn't have any edit distance parameters that need to be configured, so there is no "tuning" required. In particular, it's worth mentioning that using a very large dictionary, e.g. 500,000 words, often degrades accuracy rather than improves it, and likely increases memory usage and latency as well.
Regarding Norvig's 98.9% figure--this seems to be from Norvig's own made-up data. In the real world, users often generate misspellings that exceed 2 edit distances in many use cases (OCR, non-native speakers, medical/technical terminology, etc), and published text (often already spell-checked) doesn't reflect the same level of errors. And in any case, Norvig's spell-checker apparently only achieves an accuracy of 67% on its own chosen benchmarks, so clearly the 98.9% figure is not a realistic reflection of actual spell-checker performance, even for an edit distance of 2. Lexiathan is extremely accurate and retains high performance even on heavily degraded input, and the benchmark data (and demo) that I presented reflect that.