| I've had to do this a few times when building simple input normalization features. The trouble with scoring based on Levenshtein distance in that case is that it improperly penalizes phrases that are significantly different in length but similar in content. For example, let's say I was searching a database of countries for "North Korea". In my list I have: South Africa (LD: 6)
Congo (LD: 9)
Republic of Korea (LD: 11)
...
Democratic People's Republic of Korea (LD: 28) The actual answer (DPRK) will be pushed far to the bottom based on a naive ranking that uses LD. The hack that's worked best for me? Rank based on the number of common two-character substrings between the source and target. It's simple, easy to build an index for, and has surprisingly great results. Its ideal use case is if you don't need to return a single absolute best result and can, say, present the three best to a human being and let them pick the match. For the above search I'd get the following results using the two-character method: Republic of Korea: 4
DPRK: 4
South Africa: 1
Congo: 0 |
Knowing which algorithm to use for a problem is most important; sometimes padded hamming will do, other times minhash or bitap or dice's coefficient etc are good. Each algorithm is a kind of balance between a metric and priorities - deciding what is subjectively most important in this problem. Your method for example is not as robust as Damerau Levenshtein when it comes to analyzing DNA sequences. As far as the best default string comparison, I have found dice coefficient to be ideal (your method appears to be based on similar ideas but my hunch is it's less robust).
For your example Dice's Coefficient does as well as can be expected without a concept of countries:
and by blind luck ("ort Korea"), longest common sub-sequence does even better: http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_dis...http://en.wikipedia.org/wiki/Dice_coefficient