| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lorenzhs 4027 days ago

The residual strings are all subwords where exactly d letters were deleted. For d=1 and the word "Levenshtein", that would be {"evenshtein", "Lvenshtein", "Leenshtein", "Levnshtein", "Leveshtein", "Levenhtein", "Levenstein", "Levenshein", "Levenshtin", "Levenshten", "Levenshtei"}.

The paper does not specify how to generate those efficiently, and I haven't given it any thought yet. I don't know of any implementations of the paper, but this aspect of it should be common enough.

EDIT: sorry, didn't read your comment fully. I'm not sure what you mean with "all permutations of possible deletions". The d-deletion-Neighbourhood of w contains all sub-words of w that you obtain by deleting any d letters from w. For d=2, take any two letters and remove them. N₂(jamra) = {jam,jar,amr,jaa,ama,jmr,jra,ara,mra} (hope I didn't forget any...)

Does that make it clearer?

1 comments

jamra 4027 days ago

Yes that makes it supremely clearer. I also found a FastSS implementation, which uses the same d-deletion neighborhood. Here it is: http://fastss.csg.uzh.ch

I am looking at a python implementation for examples.

link

lorenzhs 4026 days ago

nice, that seems to be based upon a similar idea as the paper I mentioned (but earlier and less refined).

link

jamra 4026 days ago

The paper you mentioned reduces memory consumption hugely and averages out the query and insertion time. It's a good improvement.

link