Hacker News new | ask | show | jobs
by sc68cal 4702 days ago
This is currently done via scraping:

https://github.com/divegeek/uscode

The diffs are huge.

4 comments

Remember that diff is an algorithm to generate the smallest set of operations to produce version B from version A, not an accurate reconstruction of what happened. Diff algorithms are also often tuned not try as hard to find the smallest set of changes for larger documents, due to speed concerns.
Git's built-in diff algorithm is particularly bad for text. Since it's aimed at line-oriented code, it does line-based diffs, which is horrible for ASCII text that is reflowed, because every line in a paragraph will show up as changed for a small change.

Example: https://github.com/divegeek/uscode/commit/1fb2d83137dad1c6ca...

What's happened is that "Section 2" was moved later in the sentence, abbreviated as "Sec. 2", "of" was deleted, and "act" was capitalized:

    Section 2 of act July 30, 1947, ch. 392, 61 Stat. 674, provided...

    Act July 30, 1947, ch. 392, Sec. 2, 61 Stat. 674, provided...
The rest of the paragraph is unchanged, but git shows a 6-line diff with the entire paragraph replaced. GitHub attempts to do some word-based highlighting (see the timestamp lines), but it falls down on most of these paragraphs. Wikipedia's diffing tends to work better for this kind of thing; I'm not sure what they use. The upshot is that the number of lines changed may be a 5-10x overestimate.
> Since it's aimed at line-oriented code, it does line-based diffs

You can do word diffs with git:

    git diff --word-diff=color
It's still recreating a word based diff from a line based diff.

See diff.c line 793 for how it works.

It may be doing that conversion, but the conversion works. For example, committing the following text (with line breaks), then joining it all into one line, shows no differences when using 'git diff --word-diff'.

  Test the first. This will check if reflowing
  text actually produces git word-diff weirdness,
  or if it's actually decent.
The line does get reproduced on the terminal (a line diff was seen), but no text is shown in green or red to indicate an actual change.
Just checked with the old and new versions of Title_09.txt and you're right, --word-diff does the right thing. It echoes all the changed lines to the terminal, but it only marks up (and colors, in color mode) the changed words:

          [-Section 2 of act-]{+Act+} July 30, 1947, ch. 392, {+Sec. 2,+} 61 Stat. 674, provided that
Wonder if there's a way to enable that behavior on GitHub? And/or to generate repository activity statistics based on changed words rather than changed lines?
Sure, but this particular test tells you nothing. You need one where the "line diff" has identified the wrong set of changes (IE it has decided two sets of text look close enough that one is really a change into the other, even though that's not what historically happened).
If there are sporadic line differences, for that git diff supports different algorithms than the default. patience or histogram may work better.

as far as words in lines, you do have a point.

I did some work & research about diffs when I tried to visualise progression of slovak law. My best attempt was a diff method that would understand the inner structure of the law. I ended up with simple draft but I am sure somebody more competent could look into that.
At least in the US, a lot of the laws that get passed are in the form of diffs.

That is, the law that they enact says "This law is to do blah blah blah.

Subsection 1373(a) of the US code is replaced with the following text 'blah blah blah'"

The wording used is pretty standard. So you can actually parse it in most cases to see what the actual changes are.

> I ended up with simple draft but I am sure somebody more competent could look into that.

If nothing prevents you, you ought to throw this up for others to see. Worst comes to worst, no one finds it useful.

In Germany we solved this problem by converting the XML into readable markdown first:

https://github.com/bundestag/gesetze/commit/f90e8fc8eb20f081...

(see also: https://news.ycombinator.com/item?id=6137337)

Back in college, I had a project where I walked through every line of the Patriot Act and noted exactly which paragraphs were modified.

I'd be surprised if we couldn't write a parser for bills to produce more efficient diffs.