|
|
|
|
|
by kelseydh
61 days ago
|
|
I needed a fuzzy string matching algorithm for finding best name matches among a candidate list. Considered Normalized Levenshtein Distance but ended up using Jaro-Winkler. I'm curious if anybody has good resources on when to use each fuzzy string matching algorithm and when. |
|
As usual, examples from my genealogy hobby: many sites allow you to upload your family tree as a gedcom file and compare it to other people's trees or a public tree. Most of these use Levenshtein distance on names to judge similarity, and it's terrible. Anne Nilsen and Anne Olsen could be the same person, right? No!! These tools are unfortunately useless to me because they give so many false positives.
These days, an embedding model is the way to go. Even a small, bad embedding model is better than Levenshtein distance if you care about the meaning of the string.