Hacker News new | ask | show | jobs
by thaumasiotes 2044 days ago
> Git is an object database and its objects are blobs, trees and commits not diffs

What do you think is the difference between a "commit" and a "diff"?

2 comments

A "commit" doesn't contain a diff, it contains (references to) the blobs of the files at that state. Diffs are display-only, generated by comparing two full file states.
You really believe that git stores -- in full -- every version of a tracked file? Every commit that deletes the whitespace from an otherwise empty line in a 30KB file is another 30KB of hard drive space gone?
> You really believe that git stores -- in full -- every version of a tracked file?

Yes, it does.

> Every commit that deletes the whitespace from an otherwise empty line in a 30KB file is another 30KB of hard drive space

Yes, it is.

"It's worth repeating that git stores every revision of an object separately in the database, addressed by the SHA checksum of its contents. There is no obvious connection between two versions of a file; that connection is made by following the commit objects and looking at what objects were contained in the relevant trees. Git might thus be expected to consume a fair amount of disk space; unlike many source code management systems, it stores whole files, rather than the differences between revisions. It is, however, quite fast, and disk space is considered to be cheap." -- https://lwn.net/Articles/131657/

One of the insights of the git design was that, nowadays, disk space is cheap. The first releases of git always stored each object separately in its own file in the object database. Git still does so nowadays, but once the number of files gets over a certain threshold, newer releases of git run an "automatic GC" which combines these "loose objects" into a "pack file"; and within that "pack file", it uses a binary diff (a xdelta) between similar objects to reduce the total size. But that's just a physical storage optimization; in the logical model, whenever you ask for an object, you always get its full contents, not a delta against some other object.

> "automatic GC" which combines these "loose objects" into a "pack file"; and within that "pack file", it uses a binary diff (a xdelta) between similar objects to reduce the total size

Isn't it the case then that git doesn't store in full every version of a tracked file?

Deduplication and compression do not imply diffs.
git does perform delta compression. From what I can gather, git's storage engine uses conventional compression (zlib), deduplication (exactly identical files need not be stored twice) and delta compression (between similar files).

The question was does git's implementation store - in full - every version of a tracked file? The answer is that it doesn't. git has a sophisticated storage engine precisely to avoid the inefficiencies of the naive approach.

The git book explains the internals very well, so you can easily verify it for yourself. Files are referenced as objects in trees, which are pointed to by commits. editing a file creates a new object for it. (edited for tone)
From the book:

> You have two nearly identical 22K objects on your disk (each compressed to approximately 7K). Wouldn’t it be nice if Git could store one of them in full but then the second object only as the delta between it and the first?

> It turns out that it can. The initial format in which Git saves objects on disk is called a “loose” object format. However, occasionally Git packs up several of these objects into a single binary file called a “packfile” in order to save space and be more efficient. Git does this if you have too many loose objects around, if you run the git gc command manually, or if you push to a remote server.

> When Git packs objects, it looks for files that are named and sized similarly, and stores just the deltas from one version of the file to the next. You can look into the packfile and see what Git did to save space.

https://git-scm.com/book/en/v2/Git-Internals-Packfiles

That's just for compression. Commits aren't diffs, and when you checkout stuff, git doesn't do diffs to give you the working directory at that point. See https://stackoverflow.com/a/25028688/8272371 for detailed explanation.
> See https://stackoverflow.com/a/25028688/8272371 for detailed explanation.

There is a disconnect somewhere. The linked answer says:

> Now, git is different. Git stores references to complete blobs and this means that with git, only one commit is sufficient to recreate the codebase at that point in time. Git does not need to look up information from past revisions to create a snapshot.

> So if that is the case, then where does the delta compression that git uses come in?

> Well, it is nothing but a compression concept - there is no point storing the same information twice, if only a tiny amount has changed. Therefore, represent what has changed, but store a reference to it, so that the commit that it belongs to, which is in effect a tree of references, can still be re-created without looking at past commits.

You can recreate a file that is stored as a root blob plus some series of diffs without looking at information from past commits. But you can't recreate it without doing the diffs! You have to look at the root blob. This is, internally, tracked separately from the commit which created it. But your conclusion:

> when you checkout stuff, git doesn't do diffs to give you the working directory at that point.

cannot be true. If the working directory at that point corresponds to a blob which has only diff information stored, git must apply that diff to a separate blob in order to give you the working directory.

Which is explicitly a lower-level optimization applied to files well-suited for it and not related to the concept of a commit. A commit does not reference a diff.
Fair enough. The blob stores a diff. The commit stores a reference to... the diff. This is a division between the concept of the object and the implementation. But it's not an example of a diff-storing model failing to model git as it is; git as it is is storing diffs.

If a commit references a "blob", and the "blob" that it references is, in fact, a diff, why would we say that the commit "does not reference a diff"?

If it didn't, then the diff mechanism git would end up using would be purely internal to git and abstracted away for the user. Why? Because it would mean your diff algorithm is now your storage format and it is not allowed to change, ever. That's going to cause worse issues than large git repositories.

There is also the obvious performance problem that you would have to replay all diffs to get to switch between commits.

It does, yes. To add a bit more color though, what happens is that when you run `git gc` (or it's run automatically for you sometimes) an extra compression step is done that uses diffs of some sort to avoid storing so many near copies. Packfiles are related to this.
git works off of snapshots which are blobs, and blobs are compressed very effectively, but if your file takes 30KB compressed then yes, 30KB compressed is being added for every white space added[0].

Another way to think of it, if it was diffs, if you had 1000 commits, getting to the 'head' would take forever because it had to replay all the commits diffs just to get there.

Yes, you could combine diffs & snapshots, but that in itself is a tricky complexity in an already very complex system.

[0] https://tom.preston-werner.com/2009/05/19/the-git-parable.ht...

Oh my yes. Some years ago, I did not realize this. My mental model of Git had it de-duplicating common text between commits, but this is not the case. I learned the truth the hard way when I wrote a commit hook that automatically appended about a hundred lines to a text file with every commit. It worked fine at first, but eventually `git fetch` started failing.
A diff is something that describes the changes between two versions but does NOT refer to any specific version; or at least it is something that can be worked with independent of any specific version.

Ie. I can develop a fix for a issue for version 1.2.30 of some software, generate a the diff using the diff tool and then apply this diff using the patch tool to version 1.1.15 of the software. This might fail (or result in something undesired), but there no principal problem in moving the diff around and applying it somewhere else.

A git commit however is a particular version, so git is not really good at applying a commit somewhere else.