Hacker News new | ask | show | jobs
by mc_hammer 4454 days ago
fwiw the search is called fuzzy matching

also, gl on the project

2 comments

Another interesting algorithm for fuzzy searching is this one: http://www.catalysoft.com/articles/StrikeAMatch.html

It essentially compares the number of matching two letter pairs in strings.

I used it in a C# app that did a fuzzy instant search as the user typed more letters in. On a dataset of about six thousand names it performed really well, but it was a pretty simple app. I also used a javascript version on the same list to see how it performed on an X120e netbook, and again, it was instantaneous.

No idea how the performance would compare to the methods for common substring. I like the pairs matching because I can screw up letter positioning and accidentally type a letter or two that doesn't exist in the string, but still get back strongly matching results and usually find what I want. I used it because we had a huge issue at work with people brutalizing names they entered into our employee database.

Here's an implementation I wrote in javascript: https://gist.github.com/doorhammer/3016ecaac5313a87804b

I never used it in production, and it's not optimized really at all, but it shows how straight forward a typical version can be.

I took a quick look. What he described is essentially a specific case of n-grams (n=2, bigrams).

http://en.wikipedia.org/wiki/N-gram

Thanks for the information. So it looks like it's a specific application of an algorithm to vectors of bigrams? The most relevant part of the wikipedia page (I think): http://en.wikipedia.org/wiki/N-gram#n-grams_for_approximate_...

It also appears that the algorithm I linked is actually the Sørensen–Dice index. They have the exact same formula on the wiki page: http://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coef...

Appreciate the heads up. Gave me much better terms to search for. Going to add them to the notes of my gist. I'm on vacation now, so I'll have to do more reading on it over the next few days

Also made a public gist for whoever is interested:

https://gist.github.com/doorhammer/9957864

And here's an explanation with some code examples: http://www.quora.com/Algorithms/How-is-the-fuzzy-search-algo...