|
It was for the general case. The reason that I was able to do this is because I was less persistent than them. I tried to understand the paper but after 5 minutes I realized that even understanding the paper was going to take WAY more time than ignoring the paper and implementing it my own way. Here's how I did it. If we are looking for string = "banana"
Then we can represent the state of the automaton as the last row of the matrix that you get when you compute the Levenshtein distance between two fixed strings. The initial state is (in Python): def init():
return range(len(string)+1)
Then to take a step in the automaton: def step(state, char):
newstate = [0 for x in state]
newstate[0] = state[0]+1
for i in range(len(state)-1):
if i < len(string) and string[i] == char:
newstate[i+1] = state[i]
else:
newstate[i+1] = 1 + min(newstate[i],state[i],state[i+1])
return newstate
We step like this: s0 = init()
s1 = step(s0, 'c')
s2 = step(s1, 'a')
s3 = step(s2, 'b')
s4 = step(s3, 'a')
s5 = step(s4, 'n')
s6 = step(s5, 'a')
Now we can compute the lowerbound of the Levenshtein distance by doing min(s6). In this case it's 2. This means that whatever comes after "cabana", it will always have at least distance 2 to "banana". With this info we can prune away a search path in the full text index if that value is larger than our maximum edit distance.Those handful of lines of code is all you need to do fuzzy string search in practice. This represents the automaton as a step procedure. If you want you can also generate a DFA from this (though it's probably not necessary in practice). If your maximum edit distance is n then if one of the numbers in the state is greater than n it doesn't matter what it is. In the above example s6 = [6, 5, 4, 3, 2, 3, 2]. If n = 3 then s6 = [4, 4, 4, 3, 2, 3, 2] is equivalent, because in the end it only matters whether a number is >3 or not. So you might as well keep the numbers on 4. Replace: newstate[i+1] = 1 + min(newstate[i],state[i],state[i+1])
with: newstate[i+1] = 1 + min(newstate[i],state[i],state[i+1],n+1)
where n is the maximum edit distance. Now the state space of the automaton is finite, and you can generate a DFA from it by just exploring all the states with the step() function. One more optimization is to not generate the DFA for the full alphabet. If your search word is "banana" then for the purposes of the automaton the letter 'x' is equivalent to the letter 'y' because both are not equal to any letter in "banana". So instead of creating a DFA for the full ASCII alphabet (or worse, the full unicode alphabet), you can instead work with the reduced 4 letter alphabet (b,a,n,X). X represents any letter other than b,a,n.You could also do a hybrid where you generate the DFA lazily. I don't know if that made sense, it's a bit difficult to explain in a short HN comment. |
At the moment I don't see how you could handle transpositions either.
I'm not saying that your approach is bad. But I do think that the 'I did it in an hour' comment was a quite a bit misleading, if you basically ignored the paper and did something that is different in most ways.
The tradeoffs are immensely different - the whole point of the paper is that you're precomputing a looot of stuff so that the lookup is fast.