| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kelseydh 61 days ago
	I needed a fuzzy string matching algorithm for finding best name matches among a candidate list. Considered Normalized Levenshtein Distance but ended up using Jaro-Winkler. I'm curious if anybody has good resources on when to use each fuzzy string matching algorithm and when.

3 comments

vintermann 61 days ago

Levenshtein distance is rarely the similarity measure you need. Words usually mean something, and it's usually the distance in meaning you need.

As usual, examples from my genealogy hobby: many sites allow you to upload your family tree as a gedcom file and compare it to other people's trees or a public tree. Most of these use Levenshtein distance on names to judge similarity, and it's terrible. Anne Nilsen and Anne Olsen could be the same person, right? No!! These tools are unfortunately useless to me because they give so many false positives.

These days, an embedding model is the way to go. Even a small, bad embedding model is better than Levenshtein distance if you care about the meaning of the string.

link

jppittma 61 days ago

It depends on if or not you're trying to correct for typos, or do something semantic. Also, embedding distance is much much more expensive.

link

RobinL 61 days ago

There's a section in the docs of our FOSS record linkage software that covers this: https://moj-analytical-services.github.io/splink/topic_guide...

link

leeoniya 61 days ago

Levenshtein distance is often a poor way to fuzzy match or rank. i suspect that in js, even the trie approach would incur significant GC/alloc thrashing or cost of building a huge trie index.

i tried fuzzy matching using a cleverly-assembled regexp approach which works surprisingly well: https://github.com/leeoniya/uFuzzy

link

srean 61 days ago

I would argue the opposite, with the 'often' doing some heavy lifting.

It is very likely that you have interacted with a Levenstein distance based spell corrector (with many modifications) and I have touched that code. Used well they can be very powerful.

link