Hacker News new | ask | show | jobs
by pnathan 4702 days ago
I'm really tempted to collect the XML files and put them on github, with periodic checkpoints to update it with the latest.

Watching the evolution of law over time is a fascinating thing and using SW engineering tools to help would be really fun.

11 comments

This is currently done via scraping:

https://github.com/divegeek/uscode

The diffs are huge.

Remember that diff is an algorithm to generate the smallest set of operations to produce version B from version A, not an accurate reconstruction of what happened. Diff algorithms are also often tuned not try as hard to find the smallest set of changes for larger documents, due to speed concerns.
Git's built-in diff algorithm is particularly bad for text. Since it's aimed at line-oriented code, it does line-based diffs, which is horrible for ASCII text that is reflowed, because every line in a paragraph will show up as changed for a small change.

Example: https://github.com/divegeek/uscode/commit/1fb2d83137dad1c6ca...

What's happened is that "Section 2" was moved later in the sentence, abbreviated as "Sec. 2", "of" was deleted, and "act" was capitalized:

    Section 2 of act July 30, 1947, ch. 392, 61 Stat. 674, provided...

    Act July 30, 1947, ch. 392, Sec. 2, 61 Stat. 674, provided...
The rest of the paragraph is unchanged, but git shows a 6-line diff with the entire paragraph replaced. GitHub attempts to do some word-based highlighting (see the timestamp lines), but it falls down on most of these paragraphs. Wikipedia's diffing tends to work better for this kind of thing; I'm not sure what they use. The upshot is that the number of lines changed may be a 5-10x overestimate.
> Since it's aimed at line-oriented code, it does line-based diffs

You can do word diffs with git:

    git diff --word-diff=color
It's still recreating a word based diff from a line based diff.

See diff.c line 793 for how it works.

It may be doing that conversion, but the conversion works. For example, committing the following text (with line breaks), then joining it all into one line, shows no differences when using 'git diff --word-diff'.

  Test the first. This will check if reflowing
  text actually produces git word-diff weirdness,
  or if it's actually decent.
The line does get reproduced on the terminal (a line diff was seen), but no text is shown in green or red to indicate an actual change.
If there are sporadic line differences, for that git diff supports different algorithms than the default. patience or histogram may work better.

as far as words in lines, you do have a point.

I did some work & research about diffs when I tried to visualise progression of slovak law. My best attempt was a diff method that would understand the inner structure of the law. I ended up with simple draft but I am sure somebody more competent could look into that.
At least in the US, a lot of the laws that get passed are in the form of diffs.

That is, the law that they enact says "This law is to do blah blah blah.

Subsection 1373(a) of the US code is replaced with the following text 'blah blah blah'"

The wording used is pretty standard. So you can actually parse it in most cases to see what the actual changes are.

> I ended up with simple draft but I am sure somebody more competent could look into that.

If nothing prevents you, you ought to throw this up for others to see. Worst comes to worst, no one finds it useful.

In Germany we solved this problem by converting the XML into readable markdown first:

https://github.com/bundestag/gesetze/commit/f90e8fc8eb20f081...

(see also: https://news.ycombinator.com/item?id=6137337)

Back in college, I had a project where I walked through every line of the Patriot Act and noted exactly which paragraphs were modified.

I'd be surprised if we couldn't write a parser for bills to produce more efficient diffs.

It would be even awesome-er (and more useful) if you could parse individual bills and amendments into diffs, which get merged into 'master' as they become law.

I'd love to `git blame` the U.S. code.

But that only works at the shallow level. A crook can get around that by asking/bribing/convincing someone else to be the one who's responsible for the amenment.
There is very little outright corruption in Congress. Special interests exert most of the influence through campaign contributions that are publicly disclosed. Larry Lessig has a great book on this: http://www.amazon.com/Republic-Lost-Corrupts-Congress---eboo.... And here is the link to his TED talk on the same topic: http://www.youtube.com/watch?v=mw2z9lV3W1g
I wonder if Congress uses any sort of version control. The text of these bills are written by - staffers and sometimes lobbyists, so I'm not sure how it would work.
I was under the impression law is essentially "append only". New laws override existing laws, but the text of the existing law never changes.
Laws are essentially diffs against the US code. The diff (slip law) is canonical. They are continually compiled into the US code, which can involve deleting or changing text just like a diff, and periodically an edited, annotated code is published. After a certain amount of time, Congress enacts a portion of the published code, making it canonical and overriding any prior slip law.

See: http://en.wikipedia.org/wiki/United_States_Code#Legal_status

What we need now is software that reads bills ("in section 123.abc the text 'blah blah' is replaced by 'bleh bleh') and compiles it into before/after views of what the resulting code would be.
I just suggested this, and then scrolled down to find this.

Someone needs to take the plunge and start writing the program; throw it on Github and tell us all about it. I know people who are looking for such a tool.

The US Code is not the same as the laws of the US. It is a "current snapshot" of existing laws in force, and does not itself have legal weight unless explicitly granted by Congress.
See, e.g., U.S. National Bank of Oregon v. Independent Insurance Agents of America, Inc., 508 U.S. 439, 440 (1993) for the Supreme Court's ruling and underlying logic.
According to Wikipedia:

"When sections are repealed, their text is deleted and replaced by a note summarizing what used to be there." https://en.wikipedia.org/wiki/United_States_Code#Treatment_o...

Imagine trying to keep track of this in paper form instead of digitally.
Imagine? I used to do it. When I first started out, I would get stacks of the Chicago Municipal Code revisions on onionskin and it was my job to follow the instructions to update the five-inch binder.

"Remove pages 123.4 - 123.6 and replace with pages 123.4a-123.7."

Later, when I learned about diffs, I understood the concept immediately.

They've been doing exactly that for about 237 years.
We actually have something similar in germany, called the "Bundesgit" (https://github.com/bundestag/gesetze)
I'm not a native German speaker but that's a pretty clever pun, right?
Not that much of a pun, really. Just a funny-sounding portmanteau. That being said, official names of things by the government usually sound very ridiculous, so this is definitely much saner.
'Bundes-' means federal. It's not much of a pun, but I like it :)
"Diffing the law" between different date points would be amazing. I hope you follow your temptation in this regard.

I'd do it myself but I'm already neck deep in work and volunteer projects outside of work.

The funny thing is that a lot of law is structured like definitions, the actual law, then consequences. Any of those three can change independently, and change the meaning of the law. So diffs are often not as useful as you wish.
We need to parse the law into some simulation code, and then have unit tests (does scenario B cause citizen A's rights to be violated), and then check if changes break the tests.
Do it. Unit test the code. You might even be able to get law students to do it for free as a study aid.
I just created a repo with all the codes[1]. There was a convenient link to all codes in one zip file. I will note that this is a massive amount of text.

There is a link to the schema used and a stylesheet (I assume for the xhmtl maybe?) that I would like to add in. But one step at a time.

[1] https://github.com/varikin/UnitedStatesCode/

Please do! This is the kind of thing that hatches in my mind as a great idea, but withers due to a total lack of follow-up.
It would also be fascinating to see the visualizations of this evolution. And perhaps by matching it with the voting data from congress, one could track the footprint of each congressman.
see my comment above, you might be interested :)
Probably more important than just diffs is actually a dependency graph and a topical index. While they've tried to do this via titles/chapters and references, linear breaks are never going to be as successful as many-to-many linkages as every Bible concordance out there demonstrates.
I'd really like to see a graph of the number of bytes over time, and what sort of curve it fits.
Do they accept pull requests?