| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wood_spirit 700 days ago
	iirc the go-to data structure for this kind of problem is called a “vantage point tree” (vp tree). Another approach that I haven’t explored is to prep the array of words to search into a tree. normal words in Latin languages have lots of common prefixes and suffixes so you can dramatically reduce the amount of nodes (see my own old blog which compresses a scrabble dictionary https://williame.github.io/post/87682811573.html - same prefix suffix sharing our work on non-scrabble rotations too). Now walk the tree doing Levenshtein but checking multiple words at once?

1 comments

jbaiter 700 days ago

Not for strings, no. The default algorithm would be Levenshtein edit distance and friends (i.e. what this library uses). If you want to get fancy, you could even go for something like Levenshtein automata (which are used by Lucene to implement fuzzy search across terms).

link

myaccountonhn 700 days ago

In my case I needed to match multiple names (consisting of 2-5 words), I had better matches and performance using the bag of words model (https://en.wikipedia.org/wiki/Bag-of-words_model).

link