| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by acidbaseextract 1935 days ago

It's cool, but does seem quite slow. I'm diffing two 45kB CSVs on a fast computer and after 10 minutes I'm still at:

  Diffing:   0% ... 0/93195 [00:00<?, ?it/s]
  Tightening Fringe Diagonal 48 of 792: ...

4 comments

dan-robertson 1935 days ago

I think it’s ended up with a quadratic algorithm for diffing sequences and a quadratic log algorithm for diffing dictionaries.

To understand why the sequences problem is quadratic, consider a sequence A of length m being doffed with a sequence B of length n. We want to express our diff in the minimum number of operations where an operation is removing, adding, or editing an element in the sequence. Construct a graph as follows: the nodes will be the points on an mxn lattice corresponding to points in the two sequences. An edge going right means “delete this item from sequence A,” and costs (eg 1). An edge going down means “add this item from sequence B” and has a similar cost. An edge going diagonally down and right means to edit the item in A into the item in B and it’s cost depends on how different they are. The problem is to find the shortest path from the top left to the bottom right.

If you could compute the entire graph for free and then applied something like Dijkstra’s algorithm you would be worst-case quadratic (if all the diagonal costs were 2 or more, you would need to touch every node).

There are a few ways you could try to improve this:

1. Look for easy opportunities to optimise. Eg you could have a patience style strategy of cutting off any common prefix or suffix. This won’t help in the worst case.

2. Limit to a fixed width diagonal. This might mean worse diffs but means the graph search problem becomes more linear. I suspect something is going on with the diagonal based on the description

3. Somehow develop some good heuristics and use a better search algorithm like A*. This might not help in the worst case

4. Something else.

link

meetups323 1935 days ago

Quadratic doesn't need to equal "bad", especially in this case. Two 45 kB items is 2 billion entries. Allocating 2 billion bytes is easy enough. Iterating over 2 billion bytes is also not terrible. The GP says the process is estimated to take 150 hours, or half a million seconds, or 1.62e15 cycles... so around 1 million cycles per cell.

link

dan-robertson 1935 days ago

If it’s doing anything nontrivial (eg computing the weights of the diagonal edges by comparing the rows as sequences) then you’re basically screwed. The problem with quadratic is that it doesn’t scale but it’s fast enough for small inputs that it is hard to notice until you get a large input.

link

meetups323 1935 days ago

My point is that even though it's quadratic it can still be fast for the inputs mentioned (dozens of kilobytes), so long as the constant is low. If you have a quadratic algorithm that takes 1 cycle per byte of input squared(or less, using wide registers), it will be pretty damn quick for most inputs. If you have a quadratic algorithm that takes 1 million cycles for each byte of input squared (such as this one), that's a whole different story. The time to process 1 Megabyte in the 1 cycle algorithm would only let you process a kilobyte in the new.

Point is that things like being efficient with memory access and using sufficiently low level (or JIT'ed) languages can get you very far, and it's not really meaningful to dismiss an algorithm solely based on it being quadratic.

link

acidbaseextract 1935 days ago

30 minutes in and I still don't have a diff of the two CSVs... Has this program worked for anyone here?

    Diffing:   0%|                                | 153/93195 [30:02<151:31:17,  5.86s/it]
    Tightening Fringe Diagonal 82 of 792:  48%|...| 7453/15514 [00:30<00:35, 226.08it/s]

link

dj_mc_merlin 1935 days ago

Have you tried -l or -k?

link

acidbaseextract 1935 days ago

I don't currently have access to the machine and files that I was doing it on, but I'll try those in the morning!

link

IshKebab 1935 days ago

Tree diffing algorithms have very bad complexity (e.g. O(N^4)!) so this will probably only work on really small examples.

link

rmccue 1935 days ago

I have a use case for diffing trees, so would love to know of any optimal algorithms you may know of; I'm operating generally with less than 100 nodes, so it's not a huge concern, but I'm finding that discovering _any_ algorithms for this has been tough.

link

IshKebab 1935 days ago

Yeah I found the same. It seems that there hasn't really been much research in this area, and somewhat annoyingly there isn't a widely agreed term for the problem, though for some reason most of the algorithms are described in terms of diffing XML so if you search for "XML difference" you can find some papers.

There's a few algorithms like XDiff, XyDiff, XChange etc. but be prepared to find very old code on sourceforge or more likely no code at all.

I couldn't find anything with a decent complexity that either had code or was simple/well described enough that I could implement it so I gave up.

link