| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lorenzhs 4068 days ago

Unless you are the mythical 100x programmer, I doubt that you wrote a full implementation of general Levenshtein automata in an hour. I read the paper that introduced them ( http://link.springer.com/article/10.1007/s10032-002-0082-8 ) and they are quite the complex beast. Not to mention that the paper is very technical and you need to keep a dozen definitions in your head.

That said, there seems to be a fairly readable implementation at https://github.com/universal-automata/liblevenshtein

I'm currently working on implementing fast Levenshtein queries in C++ with a friend, and we intend to implement the paper I linked in my original post. So far, our dynamic programming Levenshtein already beats Lucene++ (C++ implementation of Lucene), which is a bit embarrassing [1]. If you're interested, more advanced stuff will hit https://github.com/xhochy/libfuzzymatch when we get around to implementing it.

[1] Lucene++ spends more time converting strings between UTF-8 and UTF-32 than it does computing Levenshtein distances, says the profiler.

2 comments

jules 4068 days ago

I'm not a 100x programmer, I just did a couple of things that drastically reduced the time:

1. I didn't follow that paper. Even trying to understand that paper would have taken way more time, so after 5 minutes of trying to understand it I gave up on that approach. See this comment for what I did do: https://news.ycombinator.com/item?id=9699870 That saved maybe 20x.

2. I used Python instead of C++ or Java. This saved 5x.

3. The code was throwaway quality code. This saved 2x.

Together that's 200x, but I'm at least a 2x worse programmer than them, so that gives you the 100x ;-)

link

lorenzhs 4068 days ago

(see my other comment as well)

An algorithmicist would say that all this saved you a constant factor of work for a linear slowdown ;)

link

jules 4068 days ago

That's a nice soundbite but it's not correct. The worst case performance with the DFA is linear, the same as them.

link

lorenzhs 4068 days ago

No that's just not true. Your step function takes time linear in the length of string. For example, `newstate = [0 for x in state]` takes θ(|state|) time, and because you initialise the state with `range(len(string)+1)`, that's linear in the string length.

link

jules 4067 days ago

Now you're talking about the cost of constructing the DFA, not searching the index with the resulting DFA. The cost of construcing the DFA is irrelevant, and even then you can construct the DFA in O(n) with my method for fixed max edit distance and fixed alphabet. Same as that paper.

link

jamra 4068 days ago

I'd like to implement the same paper. Perhaps I'm missing something, but I'm not sure how the residual strings are created. Do you have a link to an implementation or a description of the residual strings?

I get that a residual string is the original string with a deletion, incrementing the deletions until you hit edit distance d. What I'm not sure about is if it's all permutations of possible deletions.

link

lorenzhs 4068 days ago

The residual strings are all subwords where exactly d letters were deleted. For d=1 and the word "Levenshtein", that would be {"evenshtein", "Lvenshtein", "Leenshtein", "Levnshtein", "Leveshtein", "Levenhtein", "Levenstein", "Levenshein", "Levenshtin", "Levenshten", "Levenshtei"}.

The paper does not specify how to generate those efficiently, and I haven't given it any thought yet. I don't know of any implementations of the paper, but this aspect of it should be common enough.

EDIT: sorry, didn't read your comment fully. I'm not sure what you mean with "all permutations of possible deletions". The d-deletion-Neighbourhood of w contains all sub-words of w that you obtain by deleting any d letters from w. For d=2, take any two letters and remove them. N₂(jamra) = {jam,jar,amr,jaa,ama,jmr,jra,ara,mra} (hope I didn't forget any...)

Does that make it clearer?

link

jamra 4068 days ago

Yes that makes it supremely clearer. I also found a FastSS implementation, which uses the same d-deletion neighborhood. Here it is: http://fastss.csg.uzh.ch

I am looking at a python implementation for examples.

link

lorenzhs 4068 days ago

nice, that seems to be based upon a similar idea as the paper I mentioned (but earlier and less refined).

link

jamra 4067 days ago

The paper you mentioned reduces memory consumption hugely and averages out the query and insertion time. It's a good improvement.

link