Hacker News new | ask | show | jobs
by DannyBee 4701 days ago
Remember that diff is an algorithm to generate the smallest set of operations to produce version B from version A, not an accurate reconstruction of what happened. Diff algorithms are also often tuned not try as hard to find the smallest set of changes for larger documents, due to speed concerns.
2 comments

Git's built-in diff algorithm is particularly bad for text. Since it's aimed at line-oriented code, it does line-based diffs, which is horrible for ASCII text that is reflowed, because every line in a paragraph will show up as changed for a small change.

Example: https://github.com/divegeek/uscode/commit/1fb2d83137dad1c6ca...

What's happened is that "Section 2" was moved later in the sentence, abbreviated as "Sec. 2", "of" was deleted, and "act" was capitalized:

    Section 2 of act July 30, 1947, ch. 392, 61 Stat. 674, provided...

    Act July 30, 1947, ch. 392, Sec. 2, 61 Stat. 674, provided...
The rest of the paragraph is unchanged, but git shows a 6-line diff with the entire paragraph replaced. GitHub attempts to do some word-based highlighting (see the timestamp lines), but it falls down on most of these paragraphs. Wikipedia's diffing tends to work better for this kind of thing; I'm not sure what they use. The upshot is that the number of lines changed may be a 5-10x overestimate.
> Since it's aimed at line-oriented code, it does line-based diffs

You can do word diffs with git:

    git diff --word-diff=color
It's still recreating a word based diff from a line based diff.

See diff.c line 793 for how it works.

It may be doing that conversion, but the conversion works. For example, committing the following text (with line breaks), then joining it all into one line, shows no differences when using 'git diff --word-diff'.

  Test the first. This will check if reflowing
  text actually produces git word-diff weirdness,
  or if it's actually decent.
The line does get reproduced on the terminal (a line diff was seen), but no text is shown in green or red to indicate an actual change.
Just checked with the old and new versions of Title_09.txt and you're right, --word-diff does the right thing. It echoes all the changed lines to the terminal, but it only marks up (and colors, in color mode) the changed words:

          [-Section 2 of act-]{+Act+} July 30, 1947, ch. 392, {+Sec. 2,+} 61 Stat. 674, provided that
Wonder if there's a way to enable that behavior on GitHub? And/or to generate repository activity statistics based on changed words rather than changed lines?
Sure, but this particular test tells you nothing. You need one where the "line diff" has identified the wrong set of changes (IE it has decided two sets of text look close enough that one is really a change into the other, even though that's not what historically happened).
That is never the error I see in line-based diffs; because they're taking a much larger chunk of text to be atomic and treat irrelevant characters as relevant, they tend to give false positives (seeing two lines as completely different when in fact they're slightly modified versions of each other; or seeing two blocks of text as different when one is a reflowed version of the other.)
If there are sporadic line differences, for that git diff supports different algorithms than the default. patience or histogram may work better.

as far as words in lines, you do have a point.

I did some work & research about diffs when I tried to visualise progression of slovak law. My best attempt was a diff method that would understand the inner structure of the law. I ended up with simple draft but I am sure somebody more competent could look into that.
At least in the US, a lot of the laws that get passed are in the form of diffs.

That is, the law that they enact says "This law is to do blah blah blah.

Subsection 1373(a) of the US code is replaced with the following text 'blah blah blah'"

The wording used is pretty standard. So you can actually parse it in most cases to see what the actual changes are.

> I ended up with simple draft but I am sure somebody more competent could look into that.

If nothing prevents you, you ought to throw this up for others to see. Worst comes to worst, no one finds it useful.