It was for the general case. The reason that I was able to do this is because I was less persistent than them. I tried to understand the paper but after 5 minutes I realized that even understanding the paper was going to take WAY more time than ignoring the paper and implementing it my own way. Here's how I did it.
If we are looking for
string = "banana"
Then we can represent the state of the automaton as the last row of the matrix that you get when you compute the Levenshtein distance between two fixed strings. The initial state is (in Python):
def init():
return range(len(string)+1)
Then to take a step in the automaton:
def step(state, char):
newstate = [0 for x in state]
newstate[0] = state[0]+1
for i in range(len(state)-1):
if i < len(string) and string[i] == char:
newstate[i+1] = state[i]
else:
newstate[i+1] = 1 + min(newstate[i],state[i],state[i+1])
return newstate
Now we can compute the lowerbound of the Levenshtein distance by doing min(s6). In this case it's 2. This means that whatever comes after "cabana", it will always have at least distance 2 to "banana". With this info we can prune away a search path in the full text index if that value is larger than our maximum edit distance.
Those handful of lines of code is all you need to do fuzzy string search in practice. This represents the automaton as a step procedure. If you want you can also generate a DFA from this (though it's probably not necessary in practice). If your maximum edit distance is n then if one of the numbers in the state is greater than n it doesn't matter what it is. In the above example s6 = [6, 5, 4, 3, 2, 3, 2]. If n = 3 then s6 = [4, 4, 4, 3, 2, 3, 2] is equivalent, because in the end it only matters whether a number is >3 or not. So you might as well keep the numbers on 4. Replace:
where n is the maximum edit distance. Now the state space of the automaton is finite, and you can generate a DFA from it by just exploring all the states with the step() function. One more optimization is to not generate the DFA for the full alphabet. If your search word is "banana" then for the purposes of the automaton the letter 'x' is equivalent to the letter 'y' because both are not equal to any letter in "banana". So instead of creating a DFA for the full ASCII alphabet (or worse, the full unicode alphabet), you can instead work with the reduced 4 letter alphabet (b,a,n,X). X represents any letter other than b,a,n.
You could also do a hybrid where you generate the DFA lazily.
I don't know if that made sense, it's a bit difficult to explain in a short HN comment.
I'm not sure if I can follow - I'll give it some more thought and time. That said: You're doing something completely different as far I can tell. You build an ~automaton~ based on an input word. That's not what the paper does/what I struggled with. The paper describes a general automaton and creating a 'vector' based on the input word, that you use as steps.
At the moment I don't see how you could handle transpositions either.
I'm not saying that your approach is bad. But I do think that the 'I did it in an hour' comment was a quite a bit misleading, if you basically ignored the paper and did something that is different in most ways.
The tradeoffs are immensely different - the whole point of the paper is that you're precomputing a looot of stuff so that the lookup is fast.
Computing the step() is extremely fast. And if that's not fast enough then do the DFA construction as I described. The approach is different, yes, but that's the point.
I commented on your step function in another comment so I'm going to skip that.
Your DFA construction, while a bit incomplete (you don't say how you do the transitions), achieves roughly the same thing as Levenshtein automata do. But you spend significantly more time to construct it. The point of the original paper was not to show that DFAs can be used to compute Levenshtein distance, but to show how to do it quickly and efficiently.
You implemented an automaton that computes Levenshtein distances. However, Levenshtein automata are quite different from what you describe. Your automaton executes the basic Wagner-Fischer / Needleman-Wunsch / ... algorithm.
This is not correct. The end result of the step() based automaton is the same: it prunes exactly the same search paths as any Levenshtein automaton would. And the part where I described how to build the DFA gives you exactly a Levenshtein automaton DFA. The approach is different, yes, that's the point: it's much simpler and still does the job.
It's not nearly as efficient though, your step function requires O(len(string)) time no matter how well you prune. Since you have len(query) many steps, that gets you to O(len(string)*len(query)), aka quadratic time if they're roughly the same length. Levenshtein automata can do this in linear time because they spend time building the automaton first (preprocessing). So yes, you implemented an algorithm using automata that computes the same result. But you didn't implement Levenshtein automata.
In practice with Lucene, len(string) and len(query) are like 10. So it's totally irrelevant. Furthermore computing the step is extremely fast: you're just doing a handful of min()'s. Even a single cache miss is going to completely dominate that, let alone a disk seek. What matters is that you don't scan a 10 million word index, instead you want to prune your search to hit, say, 50 words in the index.
That's just the step() approach. After that I described how to build the DFA, which gets you the same optimal linear time that you want which does not depend on the query string size.
Reply to other comment:
> Your DFA construction, while a bit incomplete (you don't say how you do the transitions), achieves roughly the same thing as Levenshtein automata do. But you spend significantly more time to construct it. The point of the original paper was not to show that DFAs can be used to compute Levenshtein distance, but to show how to do it quickly and efficiently.
Why is it incomplete? You just follow the step() automaton and construct a DFA that does the same. Every time you hit a new state you create a new node in the DFA, and if you hit an old state you just point to a previously constructed DFA node. You can even do the DFA construction lazily.
> But you didn't implement Levenshtein automata.
A Levenshtein automaton for a word W is a finite state automaton that accepts all words V within edit distance n from W. That's exactly what I built. The point here is that you turn a 30 second search over the whole index into a 0.02 second search by pruning. If you can then optimize that to 0.015 by making the automaton more efficient that's nice but you can hardly claim that what I did is not a Levenshtein automaton because it's a bit slower (and it's not even clear to me that it would be slower).
You can argue about the lengths of these strings and the simplicity of your step function all day long, at the end of the day it's still θ(n²) and no better than the simple dynamic programming approach. In fact, even for that it's a relatively bad implementation because it constructs a new list in every step, whereas you could just reuse the same two lists (called score in your implementation) all the time.
An L1 cache miss, L2 hit costs around 10 cycles, and the L2 cache is more than sufficiently large. Even your initialisation loop takes more than that. The min loop has 11 instructions (branchless) per iteration for the minimum calculations in C++: https://goo.gl/wjhRtb - not taking into account superscalarity.
You have not shown how you prune the search, so I can't say anything about that. Of course that's the entire point of having an index.
Your DFA construction again is massively slower than what is done in the paper. The authors show how to construct the entire automaton in time O(|string|), whereas each step in your implementation takes that much time.
Whether or not you built Levenshtein automata is a pointless discussion. You say you built a DFA for Levenshtein distance (true). I'm saying that you didn't implement the paper. Both are correct.
Look, I'm not claiming you did anything wrong. I'm just pointing out that your implementation, while it was fast to write, is also much much slower than their algorithm, and you shouldn't compare the two as if they were the same.
If we are looking for
Then we can represent the state of the automaton as the last row of the matrix that you get when you compute the Levenshtein distance between two fixed strings. The initial state is (in Python): Then to take a step in the automaton: We step like this: Now we can compute the lowerbound of the Levenshtein distance by doing min(s6). In this case it's 2. This means that whatever comes after "cabana", it will always have at least distance 2 to "banana". With this info we can prune away a search path in the full text index if that value is larger than our maximum edit distance.Those handful of lines of code is all you need to do fuzzy string search in practice. This represents the automaton as a step procedure. If you want you can also generate a DFA from this (though it's probably not necessary in practice). If your maximum edit distance is n then if one of the numbers in the state is greater than n it doesn't matter what it is. In the above example s6 = [6, 5, 4, 3, 2, 3, 2]. If n = 3 then s6 = [4, 4, 4, 3, 2, 3, 2] is equivalent, because in the end it only matters whether a number is >3 or not. So you might as well keep the numbers on 4. Replace:
with: where n is the maximum edit distance. Now the state space of the automaton is finite, and you can generate a DFA from it by just exploring all the states with the step() function. One more optimization is to not generate the DFA for the full alphabet. If your search word is "banana" then for the purposes of the automaton the letter 'x' is equivalent to the letter 'y' because both are not equal to any letter in "banana". So instead of creating a DFA for the full ASCII alphabet (or worse, the full unicode alphabet), you can instead work with the reduced 4 letter alphabet (b,a,n,X). X represents any letter other than b,a,n.You could also do a hybrid where you generate the DFA lazily.
I don't know if that made sense, it's a bit difficult to explain in a short HN comment.