Hacker News new | ask | show | jobs
by todd8 3526 days ago
For name matching there is also an old (1960s) algorithm called Soundex. It is well described in Knuth's The Art of Computer Programming Vol. 3, Sorting and Searching. It's a simple algorithm so the wikipedia page is enough: https://en.wikipedia.org/wiki/Soundex#cite_ref-10
1 comments

Soundex was designed for a very specific purpose. It is very culture-dependent and, in my experience, is working very poorly in most practical applications related to matching names.
Soundex works fine as part of a larger process, especially when combined with other kinds of normalization. You need a human to make the final judgement on matches. In the course of a year I have to match 100k names to names in a database of 850k people. Soundex is great for flagging names that might match, or for flagging matches that might be incorrect. I use Soundex in combination with NYSIIS, double metaphone, lists of normally confused names, etc. Before I created our current matching process, we were creating approximately 5-10k duplicate records a year.

Quick edit: Our data sources are handwritten and typed names, often transcribed by a second party. So algorithms that detect transposition errors as well as phonetic errors are really helpful.

I've used a Python implementation of soundex() in a production data mining app to help resolve things like ECQUADOR->ECUADOR. Worked well (as an entity resolution mechanism among many others).