| I think it’s ended up with a quadratic algorithm for diffing sequences and a quadratic log algorithm for diffing dictionaries. To understand why the sequences problem is quadratic, consider a sequence A of length m being doffed with a sequence B of length n. We want to express our diff in the minimum number of operations where an operation is removing, adding, or editing an element in the sequence. Construct a graph as follows: the nodes will be the points on an mxn lattice corresponding to points in the two sequences. An edge going right means “delete this item from sequence A,” and costs (eg 1). An edge going down means “add this item from sequence B” and has a similar cost. An edge going diagonally down and right means to edit the item in A into the item in B and it’s cost depends on how different they are. The problem is to find the shortest path from the top left to the bottom right. If you could compute the entire graph for free and then applied something like Dijkstra’s algorithm you would be worst-case quadratic (if all the diagonal costs were 2 or more, you would need to touch every node). There are a few ways you could try to improve this: 1. Look for easy opportunities to optimise. Eg you could have a patience style strategy of cutting off any common prefix or suffix. This won’t help in the worst case. 2. Limit to a fixed width diagonal. This might mean worse diffs but means the graph search problem becomes more linear. I suspect something is going on with the diagonal based on the description 3. Somehow develop some good heuristics and use a better search algorithm like A*. This might not help in the worst case 4. Something else. |