Hacker News new | ask | show | jobs
by vcmiraldo 1807 days ago
That is a really difficult problem for more reasons than what fits in this comment :) In fact, I got my PhD studying this very problem (https://victorcmiraldo.github.io/data/MiraldoPhD.pdf).

I did not find any description of how your diffing algorithm works nor how you represent a patch. I'd be really curious to know more.

2 comments

Wow, thank you for the pointer! I've added it to https://github.com/Wilfred/difftastic/wiki/Structural-Diffs as I'm trying to understand the other solutions in this space.

Difftastic does not create a patch or worry about merging. That's a hard problem that I'm not trying to solve. Instead, it builds two ASTs, then marks each node as unchanged or novel.

To compute the diff, I use a graph search. Each vertex represents a position in both the left and right ASTs.

Suppose you're comparing A with X A.

Start node:

  Left: A   Right: X A
        ^          ^
The possible next nodes are:

(1) Treat A on the left as novel.

  Left: A   Right: X A
         ^         ^
(2) Treat X on the right as novel.

  Left: A   Right: X A
        ^            ^
Both (1) and (2) are the same 'distance', but (2) is closer to the end node, because there's a edge from (2) to the end that marks A as unchanged.

I've implemented this using Dijkstra's algorithm. My graph is directed and acyclic, so there are faster algorithms like topological sort. However, I don't construct the whole graph in advance (that would take O(N^2) memory) so instead I construct the graph nodes as necessary.

(This is very similar to Autochrome, which I've linked in the README. Autochrome has a worked example which is really helpful.)

At some point I think I'll have to use A* search instead. If there are more than 500 lines of code with lots of changes, difftastic can take a few seconds to terminate due to the naive graph search.

Thanks for the reply Wilfred! I was not familiar with Autochrome, I will certainly have a look!

That's interesting, I like the idea of not worrying about patching nor merging, giving you a tool that is focused on "communicating the differences to a human", and indeed, it means you don't have to worry about a whole bag of problems.

One insight that I came across (more info on Chap 5 of my thesis) is that not considering or handling duplication means you incur a quadratic slowdown in your search algorithm. For example, say you're diffing `A` against `Bin A A`. If you can't understand that `A` was duplicated, which `A` do you copy? You have to evaluate both options even though it really doesn't matter which one you copy.

One good middle ground for speeding up your algorithm while not having to worry about displaying duplications is to have an intermediate step where first you diff with duplication detection, but then you just go over the result and make arbitrary choices about which duplicate to copy and which to insert/delete.

I think the diffing is the “obvious” graph search algorithm between trees, where a “tree” is a list of atoms or trees (think lisp lists).

Basically to diff a tree of n top-level elements against one of m elements, construct a graph where nodes lie on an (n+1)x(m+1) grid. Each node (a,b) corresponds to having looked at a elements of the first and matched them to b elements of the second list. Add edges (a,b)->(a+1,b) for deletion; (a,b)->(a,b+1) for insertion; and (a,b)->(a+1,b+1) for an inner diff (ie basically this graph search problem again). Choose weights to apply to node and now find the shortest path from (0,0) to (n,m).

From you description it seems like we're just computing the standard insert-delete tree-edit-distance. These tend to be slow.

This implies that the patch language only supports insertion, deletion and modification of nodes, which is a shame since refactorings, moves and duplications are also common operations in the source-code domain. Additionally, if the patch language only supports insertion, deletion and modification, the merging algorithm will perform poorly.

Yep, that's a fair description. Note that I'm not providing a merge algorithm, just a pretty way of viewing changes.

I did look at modelling moves in an earlier prototype, but it's incredibly hard to display the result in a coherent way when there are also insertions. It was also easier to drop it when I moved to Dijkstra.

As you can see in the screenshot in the readme, it does support inserting tree nodes whilst preserving children, which covers a ton of cases.