Hacker News new | ask | show | jobs
by UhUhUhUh 3367 days ago
I like the reference to a metric, which I always felt could be a useful approach to many problems, including in AI. Re spell-checking, I wonder whether there is an algorithm that uses topography (i.e. vicinity of keys) to assist with the correction...
1 comments

There is ... a crude spell checker could be implemented by dumping a dictionary of words into a metric tree and using levenshtein distance (or damerau-levenshtein distance) as the distance function. The spell check would then consist of doing a K nearest neighbor lookup.

But it doesn't work as well as you expect. A metric tree requires a true metric to be used as the distance function, which means the distance function must satisfy both symmetry and triangle inequality. Levenshtein distance satisfies this - but unfortunately this usually isn't enough for a robust spell-checker that just "does the right thing". For that, you usually need something like a weighted levenshtein distance (e.g., an incorrect first character should weigh more than an incorrect Nth character, or an incorrect consonant should weigh more than an incorrect vowel). But it is much harder to guarantee that such a distance function is a true metric.

Well for a spell checker its not too unreasonable to require the triangle inequality, if there's some series of possible typos to get from word A to word B then the distance between A and B shouldn't be more than the total 'edit distance' of the typos, however you measure it.

Making things symmetrical might not make as much sense though, omitting a letter might be far more likely than adding an additional letter, for instance. Assuming it satisfies all other axioms you should be able to turn this into a true metric, by just measuring the distance both ways and taking the average.

Then again it looks like the BK-tree might not need the symmetry axiom, at least not in an obvious way.

Guess I'll ask here, any non-brute force techniques to speed searching with such a metric? (specifically one that violates triangle inequality)
> an incorrect first character should weigh more than an incorrect Nth character

It's possible that this is true for how most people type, but not for me.