Hacker News new | ask | show | jobs
by mmarx 4031 days ago
> diff(1) shows "deletions, insertions, and substitutions".

diff(1) doesn't give you a _minimal_ set of edits to apply to go from one file to the other, just _a_ set of edits.

1 comments

Also, I think he's picturing two almost-equal files. In that case the average running time should be way lower, no? (I believe the quadratic time is worst case)
> two almost-equal files. In that case the average running time should be way lower, no?

Yes, indeed! What you're thinking of is "output-sensitive" algorithms. There are some output-sensitive algorithms for edit distance. The fastest one I found is in "Improved Algorithms for Approximate String Matching" by Dimitris and Georgios Papamichail. They note:

"We designed an output sensitive algorithm solving the edit distance problem between two strings of lengths n and m respectively in time O((s-|n-m|)min(m,n,s)+m+n) and linear space, where s is the edit distance between the two strings."

http://arxiv.org/abs/0807.4368

Cool. The naive approach yields O(exp(s)*n), this looks like a nice improvement.
Yes, but complexity theorists are mostly interested in worst case analysis I believe (which I think is rather unfortunate) -- so the quoted numbers are probably what you get from plugging in 100,000^2 * dt or so.
It's also got a lower quadratic bound, it has to fill in the whole matrix of m*n cells, at least for the simple implementation. There may well be some optimisation for almost identical strings.
1,000 years? Really?

Isn't it more likely that somebody misquoted "slow, like a half an hour" as "slow, like a THOUSAND YEARS"?

  >>> 1000 / ((1024. ** 6) / 3e9 / 86400 / 365)
  82.0593593076069
The algorithm is quadratic in the input size. For a Gigabyte of data, that's 1024^6 operations. Dividing that by 3 * 10^9 operations/second (assuming a 3GHz CPU), 86400 (the number of seconds in a day), and 365 (the number of days in a year), we obtain the runtime (in years) assuming that comparing a single byte takes exactly one operation. Dividing 1000 by that number, we get ~82 operations to compare a single byte, and that doesn't look unreasonable.
They're quoting exponential (2^N), not quadratic (N^2) time.

If on some machine a quadratic-time algorithm took, say, a hundredth of a second to process 100 elements, an exponential-time algorithm would take about 100 quintillion years.

That is yet another section of the article, the thousand years clearly reference the edit distance, which is quadratic.