Hacker News new | ask | show | jobs
by wood_spirit 700 days ago
iirc the go-to data structure for this kind of problem is called a “vantage point tree” (vp tree).

Another approach that I haven’t explored is to prep the array of words to search into a tree. normal words in Latin languages have lots of common prefixes and suffixes so you can dramatically reduce the amount of nodes (see my own old blog which compresses a scrabble dictionary https://williame.github.io/post/87682811573.html - same prefix suffix sharing our work on non-scrabble rotations too). Now walk the tree doing Levenshtein but checking multiple words at once?

1 comments

Not for strings, no. The default algorithm would be Levenshtein edit distance and friends (i.e. what this library uses). If you want to get fancy, you could even go for something like Levenshtein automata (which are used by Lucene to implement fuzzy search across terms).
In my case I needed to match multiple names (consisting of 2-5 words), I had better matches and performance using the bag of words model (https://en.wikipedia.org/wiki/Bag-of-words_model).