Hacker News new | ask | show | jobs
by archgoon 4869 days ago
This tutorial is great, but it propagates a misconception about git.

"A commit in git is a recorded set of changes that you have made"

Git commits are _not_ deltas. They are entire snapshots of the repository and a single (optional) pointer to an ancestor commit[1]. Git may handle _compression_ in terms of deltas (see 'Packfiles' in [2]), but logically, a commit should be thought of as equivalent to the state of all files that are being tracked. That difference is that if you were only looking at diffs, commits would be the _edges_ of a graph, rather than a node plus a single edge. This is why rebases change the commit SHAs but not merges (and why merges create a new commit). This is why if you are on a merge branch, 'git checkout HEAD~3' may not bring you to where 'git log' would naively suggest.

Version control systems that actually do think of 'commits' as pure 'deltas' are ones such as darcs.

A really good, low level explanation, of git is here

http://git-scm.com/book/en/Git-Internals-Git-Objects

(BUG REPORT) The commit created with 'git merge b2' from branch b1 should have HEAD~1 point to the previous head of b1, not b2.

(that said, this is a really cool thing. :) I look forward to the author adding support for conflict resolution. )

[1] http://git-scm.com/book/en/Git-Internals-Git-Objects#Commit-...

[2] http://git-scm.com/book/en/Git-Internals-Packfiles

2 comments

Thanks a ton for catching this. I guess there is a distinction to be made -- the compression might use delta's, but a commit specifies the entire state of the repository.

It's a tricky line to walk though, because commands like "git show" and "git patch" clearly show the delta-like nature of a single commit. I also don't want newcomers to think that commits are heavy and should be used sparingly.

I'm totally down to discuss this on a github issue with you, we could go over the wording. Maybe something like "a commit specifies the entire state of a repository, but is usually stored on disk as a set of changes"?

EDIT: moving discussion to: https://github.com/pcottle/learnGitBranching/issues/6

EDIT: fixed in: https://github.com/pcottle/learnGitBranching/commit/168852b2...

This is still not quite correct. Let me outline the structure, and then I will try and submit a PR addressing it.

The first step to committing is staging what should be included.

The staging process specifies an index of files that are to be added to the next commit. When the commit is recorded, git checks every file/chunk in the index; a hash is calculated per each of these blobs, each blob and hash are stored in a key<->object store, the object store, and the hashes are written into the index of the commit.

If a blob already exists in the store then it is not added again.

When changes are made and committed after this point, the resulting blobs are then hashed and stored again. Any unchanged blob does not need to be stored again; any changed blobs are stored.

When a commit is recreated, it's index is evaluated. Each blob is retrieved from the object store and placed into the tree in the appropriate location.

The most important thing to take out of this? Rebuilding a tree from blobs is fast. Second thing to take out? Git only stores each version of a blob (be it a file or a chunk) once, so most 'unpacked' repositories are still quite small.

Now, this is obviously not the smallest representation of the repository, so git has a packed format which calculates deltas between blob files. This will calculate blob-deltas even if they are completely separated in the history; deltas are not between commits, instead they are between objects. Unpacking deltas recreates the blob objects required to build the tree.

The packing process happens every now and then, but it is definitely not done every time a commit is made (by default). The most visible place it is used is when transferring over network protocols (I can't recall if it is done for every network transfer, but I suspect it is). It is done when running garbage collection as well.

----

The reason why all this is important is as I laid out before: rebuilding trees is fast, which makes fast branching possible, and the object store allows this without exploding the size of the repository.

I'll be honest -- that was hard to understand. I don't know the low-level plumbing of git very well, which is why I made a higher level tool like this.

Do you think it's important for beginners to understand all these subtleties? I think I could maybe eventually introduce them, but for the first level on the first screen, I don't think throwing a bunch of concepts at them will help with learning. Feel free to re-open the task if you disagree

I'll admit that I don't know what your knowledge is, so my explanation might have been directed at the wrong level. I am drafting a pull request that rewrites this for you, hopefully that will be easier to understand!

I don't think that you need to understand all the plumbing, however it is important to understand how the index works, and that git stores the entire snapshot. The way it currently appears makes it seem like every time you switch branches git has to figure out the end state based on deltas, which is not true. Git has fast branch switching precisely because it stores the snapshot in its entirety.

[EDIT] Here is what I wrote if anyone is interested: http://gist.io/4969804

Thanks for explaining this. Now I understand what the

    index 1ef56e5..2c756d0
in a diff refers to. I'm now poking around in the object store with `git cat-file -p <hash>` and it's very enlightening.
That sounds reasonable, looks like someone has already opened an issue. :)
They are entire snapshots of the repository and a single (optional) pointer to an ancestor commit

Nit: A commit can have multiple ancestors, as in a merge.