Hacker News new | ask | show | jobs
by chipx86 372 days ago
Git is using a proprietary variant on top of Unified Diffs. Unified Diffs themselves convey very little information about the file being modified, focusing solely on the line-based contents of text files and allowing vendors to provide their own "garbage" lines containing anything else. Every SCM that tracks information beyond line changes in a diff fills out the garbage data differently.

The intent here isn't to let you copy changes from one type of repository to another, but to have a format that can be generated from many SCMs that a tool could parse in a consistent way.

Right now, tools working with diffs from multiple types of SCMs need at least one diff parser per SCM (some provide multiple formats, or have significantly changed compatibility between releases.

For SCMs that lack a diff format (there are several) or lack one that contains enough information to identify a file or its changes (there are several), tools also need to choose a method to represent that information. That often means yet another custom diff format that is specific to the tool consuming the diff.

We've spent over 20 years dealing with the headaches and pain points here, giving it a lot of thought. DiffX (which is now a few years old itself) has worked out very well as a solution for us. This wasn't done in a vacuum, but rather has gone through many rounds of discussion with developers at a few different SCM vendors who have given thought to these issues and supplied much valuable feedback and improvements for the spec.

1 comments

> Git is using a proprietary variant on top of Unified Diffs.

What definition of "proprietary" are you using?

Created by the Git team for Git's purposes, rather than something documented or proposed for wider adoption.

Other SCMs can and do use a Git-style diff format, but as there's no defined grammar, there are sometimes important differences. For example, Mercurial's Git-style diffs represent the revisions in a different format than Git's does with different meanings, reuse Git "index" lines for binary files but include SHAs of the file contents instead of any sort of revision, and have a header block that should be stripped out before sending to a Git-style diff parser.

Aren't you doing the same thing? After all this is just review Board's custom diff format that nobody else uses.
Yep! We spent 20 years dealing with these problems and in those 20 years nobody really solved these pain points. So we talked to some SCM vendors, bounced ideas around, built a spec, got feedback from them, repeated off-and-on for a couple years until we got the current draft, and implemented it for our needs.

It's been a few years now, and so far so good for the purposes we built it for. And it's there for any other tool or SCM authors to use if it also happens to be useful to them.

Feels more like in 20 years nobody else really has those pain points.

1. For most people using multiple SCMs is just a huge and easily-avoidable mistake. Most people can just mandate a single SCM for a project and then all these problems are moot.

2. For the things listed in TFA

    A single diff can’t represent a list of commits
That's what "patch" and "patch format" is for. It works great.

    There’s no standard way to represent binary patches
Very unclear why anyone needs this. There's no standard way to code-review a binary diff (it depends what the blob is that you're diffing) so how would it help if you had this standard way to represent the diff?

    Diffs don’t know about text encodings (which is more of a problem than you might think)
This goes away if people on a project agree a particular encoding (which is going to be utf-8 lets face it). If someone sends a diff in an incorrect file encoding via diffx it will still apply wrong if someone uses a non-diffx aware (aka standard) tool to apply it. So diffx doesn't really fix this problem.

    Diffs don’t have any standard format for arbitrary metadata, so everyone implements it their own way.
This goes away if you just use one SCM for a project which you should anyway for everyone's sanity.
> 1. For most people using multiple SCMs is just a huge and easily-avoidable mistake. Most people can just mandate a single SCM for a project and then all these problems are moot.

You talk about SCMs, we're talking about VCSs. Where it's not just source code under control, or even source code with a handful of binary assets. Imagine dealing with a VCS that has to handle 15 years and a few petabytes of binary assets. Or individual files that were multiple gigabytes and had changes made to them several times per day. Can git do that gracefully just by itself? Or SVN? Even Perforce struggled with something like that back in the day.

>Very unclear why anyone needs this. There's no standard way to code-review a binary diff (it depends what the blob is that you're diffing) so how would it help if you had this standard way to represent the diff?

A standard way of handling the binary data doesn't mean understanding the binary data. You can leave that up to specific tools. What you need though is a way to somehow package up and describe those binary diffs enough that you can transport the diff data and pick the right tool to show you the actual differences.

> This goes away if you just use one SCM for a project which you should anyway for everyone's sanity.

And if wishes were fishes, I'd never be hungry again. If you have a lot of history, a lot of data, a lot of workflows and tools built up around multiple VCSs, then changing that to just one VCS is going to be a massive undertaking. And not every VCS can handle all of the kinds of data that might get input into it. Some are going to be good at text data, some might handle binary assets better. Some might have a commit model that makes sense for one type of workflow but not for another. For example, you might be dealing with binary assets where you can only have one person working on a specific file at a time because there's no real way to merge changes from multiple people, so they need to lock it. For text assets though, you might be able to handle having multiple people work on a file. To afford both workflows, your VCS now needs to not only support both locking modes, but be hyper-aware of the specific content to know which kind of locking to permit for specific files.

The world doesn't always fit into the nice little models that the most popular VCSs provide. So if you're trying to not limit your product to supporting just those handful of popular VCSs, you can't just assume everything will fit into one of those models.