| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Wilfred 1807 days ago

Wow, thank you for the pointer! I've added it to https://github.com/Wilfred/difftastic/wiki/Structural-Diffs as I'm trying to understand the other solutions in this space.

Difftastic does not create a patch or worry about merging. That's a hard problem that I'm not trying to solve. Instead, it builds two ASTs, then marks each node as unchanged or novel.

To compute the diff, I use a graph search. Each vertex represents a position in both the left and right ASTs.

Suppose you're comparing A with X A.

Start node:

  Left: A   Right: X A
        ^          ^

The possible next nodes are:

(1) Treat A on the left as novel.

  Left: A   Right: X A
         ^         ^

(2) Treat X on the right as novel.

  Left: A   Right: X A
        ^            ^

Both (1) and (2) are the same 'distance', but (2) is closer to the end node, because there's a edge from (2) to the end that marks A as unchanged.

I've implemented this using Dijkstra's algorithm. My graph is directed and acyclic, so there are faster algorithms like topological sort. However, I don't construct the whole graph in advance (that would take O(N^2) memory) so instead I construct the graph nodes as necessary.

(This is very similar to Autochrome, which I've linked in the README. Autochrome has a worked example which is really helpful.)

At some point I think I'll have to use A* search instead. If there are more than 500 lines of code with lots of changes, difftastic can take a few seconds to terminate due to the naive graph search.

1 comments

vcmiraldo 1807 days ago

Thanks for the reply Wilfred! I was not familiar with Autochrome, I will certainly have a look!

That's interesting, I like the idea of not worrying about patching nor merging, giving you a tool that is focused on "communicating the differences to a human", and indeed, it means you don't have to worry about a whole bag of problems.

One insight that I came across (more info on Chap 5 of my thesis) is that not considering or handling duplication means you incur a quadratic slowdown in your search algorithm. For example, say you're diffing `A` against `Bin A A`. If you can't understand that `A` was duplicated, which `A` do you copy? You have to evaluate both options even though it really doesn't matter which one you copy.

One good middle ground for speeding up your algorithm while not having to worry about displaying duplications is to have an intermediate step where first you diff with duplication detection, but then you just go over the result and make arbitrary choices about which duplicate to copy and which to insert/delete.