Hacker News new | ask | show | jobs
by MtotheThird 5158 days ago
I've had to do this a few times when building simple input normalization features. The trouble with scoring based on Levenshtein distance in that case is that it improperly penalizes phrases that are significantly different in length but similar in content.

For example, let's say I was searching a database of countries for "North Korea". In my list I have:

South Africa (LD: 6) Congo (LD: 9) Republic of Korea (LD: 11) ... Democratic People's Republic of Korea (LD: 28)

The actual answer (DPRK) will be pushed far to the bottom based on a naive ranking that uses LD.

The hack that's worked best for me? Rank based on the number of common two-character substrings between the source and target. It's simple, easy to build an index for, and has surprisingly great results. Its ideal use case is if you don't need to return a single absolute best result and can, say, present the three best to a human being and let them pick the match.

For the above search I'd get the following results using the two-character method:

Republic of Korea: 4 DPRK: 4 South Africa: 1 Congo: 0

7 comments

The problem is that what you call the actual answer - DPRK - is subjective. By picking DPRK you need a more intelligent algorithm than fuzzy string matching because you are looking for the right answer in terms of semantic knowledge of countries, not blind character comparisons. Fundamentally, the fact that DPRK is the right answer is an accident of History and English, there really is no right answer and no algorithm can capture this insight without being told. By biasing in one direction you are sacrificing performance in some other unknown. The trick is in balancing the bias with general ability. As always.

Knowing which algorithm to use for a problem is most important; sometimes padded hamming will do, other times minhash or bitap or dice's coefficient etc are good. Each algorithm is a kind of balance between a metric and priorities - deciding what is subjectively most important in this problem. Your method for example is not as robust as Damerau Levenshtein when it comes to analyzing DNA sequences. As far as the best default string comparison, I have found dice coefficient to be ideal (your method appears to be based on similar ideas but my hunch is it's less robust).

For your example Dice's Coefficient does as well as can be expected without a concept of countries:

  ("dice", "South Africa", 0.125); 
  ("dice", "Congo", 0.0);
  ("dice", "Republic of Korea", 0.44);
  ("dice", "Democratic People's Republic of Korea", 0.24)]
and by blind luck ("ort Korea"), longest common sub-sequence does even better:

  [("lcs", "South Africa", 6.0); 
   ("lcs", "Congo", 2.0);
   ("lcs", "Republic of Korea", 7.0);
   ("lcs", "Democratic People's Republic of Korea", 9.0)] 
http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_dis...

http://en.wikipedia.org/wiki/Dice_coefficient

Oh, I'm under no illusions that it's a universally applicable string matcher. What it seems to be useful for is normalizing human input in situations where someone can pick the best match. It handles typical normalization transforms well (word fragments, transpositions) and is very easy to index.

In my example, DPRK is the objective best answer .. There's pretty clearly a correct match in the value domain, the challenge is to help the user find it as quickly as possible.

Use the Jaccard over a sweeping window or something.

over 5 letter windows:

  TEST:   A 0.4
  TEST:   B 0.540540540541
  TEST:   C 0.692307692308
over words:

  TEST:   A 0.555555555556
  TEST:   B 0.714285714286
  TEST:   C 1.0
Here's the Korea one over a 5 letter window:

  north korea:congo 0.0
  north korea:democratic people's republic of korea 0.0526315789474
  north korea:republic of korea 0.111111111111
  north korea:south africa 0.0
I had to do fuzzy string matching for car brand names, Two character pair matching tended to work best.

Here is a gist of the code in ruby if anybody wants to steal it. I stole it from somewhere else and thus you have the history of programming.

https://gist.github.com/1038659

What kind of thought process and/or experimentation led you to that algorithm?
The original code links to:

http://stackoverflow.com/questions/653157/a-better-similarit...

where there is more discussion, and links out to different string similarity metrics

I struggled to make LD work well, then someone smarter than me suggested the two character method. :)
That's a cool idea.

I wonder how the weights would turn out if your algo were added as a term in his formula and run through the optimization, or even applied to his sample data standalone.

If you're dealing with people's names, there's also the soundex algorithm.

https://en.wikipedia.org/wiki/Soundex

Very interesting. You should put this on the SO question.