Hacker News new | ask | show | jobs
by matthiasv 2044 days ago
While this sounds all nice it actually fails to model Git as it is. Git is an object database and its objects are blobs, trees and commits not diffs, so your premise is based on a misconception.
3 comments

Git itself requires you be able to think of it as both models, diffs and snapshots. For example most uses of `git rebase` are clearer if your mental model while doing so are of diffs.

That only one is how it's implemented is besides the point really, until you get _quite_ low level.

Of course when working with Git it makes sense to think in changesets. But OP was specifically modelling the technical side starting with "We have files, which are inert objects …".
And that's a bigger problem with git than the horrible commands. Every project I've worked on that has used git has had a diff-based workflow, and sometimes the mismatch is painful — I wish I didn't have to know about `--full-history` and that `git show X` isn't necessarily the same as `git diff X^..X`.

I really hope something like pijul takes off.

> Git is an object database and its objects are blobs, trees and commits not diffs

What do you think is the difference between a "commit" and a "diff"?

A "commit" doesn't contain a diff, it contains (references to) the blobs of the files at that state. Diffs are display-only, generated by comparing two full file states.
You really believe that git stores -- in full -- every version of a tracked file? Every commit that deletes the whitespace from an otherwise empty line in a 30KB file is another 30KB of hard drive space gone?
> You really believe that git stores -- in full -- every version of a tracked file?

Yes, it does.

> Every commit that deletes the whitespace from an otherwise empty line in a 30KB file is another 30KB of hard drive space

Yes, it is.

"It's worth repeating that git stores every revision of an object separately in the database, addressed by the SHA checksum of its contents. There is no obvious connection between two versions of a file; that connection is made by following the commit objects and looking at what objects were contained in the relevant trees. Git might thus be expected to consume a fair amount of disk space; unlike many source code management systems, it stores whole files, rather than the differences between revisions. It is, however, quite fast, and disk space is considered to be cheap." -- https://lwn.net/Articles/131657/

One of the insights of the git design was that, nowadays, disk space is cheap. The first releases of git always stored each object separately in its own file in the object database. Git still does so nowadays, but once the number of files gets over a certain threshold, newer releases of git run an "automatic GC" which combines these "loose objects" into a "pack file"; and within that "pack file", it uses a binary diff (a xdelta) between similar objects to reduce the total size. But that's just a physical storage optimization; in the logical model, whenever you ask for an object, you always get its full contents, not a delta against some other object.

> "automatic GC" which combines these "loose objects" into a "pack file"; and within that "pack file", it uses a binary diff (a xdelta) between similar objects to reduce the total size

Isn't it the case then that git doesn't store in full every version of a tracked file?

Deduplication and compression do not imply diffs.
The git book explains the internals very well, so you can easily verify it for yourself. Files are referenced as objects in trees, which are pointed to by commits. editing a file creates a new object for it. (edited for tone)
From the book:

> You have two nearly identical 22K objects on your disk (each compressed to approximately 7K). Wouldn’t it be nice if Git could store one of them in full but then the second object only as the delta between it and the first?

> It turns out that it can. The initial format in which Git saves objects on disk is called a “loose” object format. However, occasionally Git packs up several of these objects into a single binary file called a “packfile” in order to save space and be more efficient. Git does this if you have too many loose objects around, if you run the git gc command manually, or if you push to a remote server.

> When Git packs objects, it looks for files that are named and sized similarly, and stores just the deltas from one version of the file to the next. You can look into the packfile and see what Git did to save space.

https://git-scm.com/book/en/v2/Git-Internals-Packfiles

That's just for compression. Commits aren't diffs, and when you checkout stuff, git doesn't do diffs to give you the working directory at that point. See https://stackoverflow.com/a/25028688/8272371 for detailed explanation.
Which is explicitly a lower-level optimization applied to files well-suited for it and not related to the concept of a commit. A commit does not reference a diff.
If it didn't, then the diff mechanism git would end up using would be purely internal to git and abstracted away for the user. Why? Because it would mean your diff algorithm is now your storage format and it is not allowed to change, ever. That's going to cause worse issues than large git repositories.

There is also the obvious performance problem that you would have to replay all diffs to get to switch between commits.

It does, yes. To add a bit more color though, what happens is that when you run `git gc` (or it's run automatically for you sometimes) an extra compression step is done that uses diffs of some sort to avoid storing so many near copies. Packfiles are related to this.
git works off of snapshots which are blobs, and blobs are compressed very effectively, but if your file takes 30KB compressed then yes, 30KB compressed is being added for every white space added[0].

Another way to think of it, if it was diffs, if you had 1000 commits, getting to the 'head' would take forever because it had to replay all the commits diffs just to get there.

Yes, you could combine diffs & snapshots, but that in itself is a tricky complexity in an already very complex system.

[0] https://tom.preston-werner.com/2009/05/19/the-git-parable.ht...

Oh my yes. Some years ago, I did not realize this. My mental model of Git had it de-duplicating common text between commits, but this is not the case. I learned the truth the hard way when I wrote a commit hook that automatically appended about a hundred lines to a text file with every commit. It worked fine at first, but eventually `git fetch` started failing.
A diff is something that describes the changes between two versions but does NOT refer to any specific version; or at least it is something that can be worked with independent of any specific version.

Ie. I can develop a fix for a issue for version 1.2.30 of some software, generate a the diff using the diff tool and then apply this diff using the patch tool to version 1.1.15 of the software. This might fail (or result in something undesired), but there no principal problem in moving the diff around and applying it somewhere else.

A git commit however is a particular version, so git is not really good at applying a commit somewhere else.