| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by xxbondsxx 4869 days ago

Thanks a ton for catching this. I guess there is a distinction to be made -- the compression might use delta's, but a commit specifies the entire state of the repository.

It's a tricky line to walk though, because commands like "git show" and "git patch" clearly show the delta-like nature of a single commit. I also don't want newcomers to think that commits are heavy and should be used sparingly.

I'm totally down to discuss this on a github issue with you, we could go over the wording. Maybe something like "a commit specifies the entire state of a repository, but is usually stored on disk as a set of changes"?

EDIT: moving discussion to: https://github.com/pcottle/learnGitBranching/issues/6

EDIT: fixed in: https://github.com/pcottle/learnGitBranching/commit/168852b2...

2 comments

Cogito 4869 days ago

This is still not quite correct. Let me outline the structure, and then I will try and submit a PR addressing it.

The first step to committing is staging what should be included.

The staging process specifies an index of files that are to be added to the next commit. When the commit is recorded, git checks every file/chunk in the index; a hash is calculated per each of these blobs, each blob and hash are stored in a key<->object store, the object store, and the hashes are written into the index of the commit.

If a blob already exists in the store then it is not added again.

When changes are made and committed after this point, the resulting blobs are then hashed and stored again. Any unchanged blob does not need to be stored again; any changed blobs are stored.

When a commit is recreated, it's index is evaluated. Each blob is retrieved from the object store and placed into the tree in the appropriate location.

The most important thing to take out of this? Rebuilding a tree from blobs is fast. Second thing to take out? Git only stores each version of a blob (be it a file or a chunk) once, so most 'unpacked' repositories are still quite small.

Now, this is obviously not the smallest representation of the repository, so git has a packed format which calculates deltas between blob files. This will calculate blob-deltas even if they are completely separated in the history; deltas are not between commits, instead they are between objects. Unpacking deltas recreates the blob objects required to build the tree.

The packing process happens every now and then, but it is definitely not done every time a commit is made (by default). The most visible place it is used is when transferring over network protocols (I can't recall if it is done for every network transfer, but I suspect it is). It is done when running garbage collection as well.

----

The reason why all this is important is as I laid out before: rebuilding trees is fast, which makes fast branching possible, and the object store allows this without exploding the size of the repository.

link

xxbondsxx 4869 days ago

I'll be honest -- that was hard to understand. I don't know the low-level plumbing of git very well, which is why I made a higher level tool like this.

Do you think it's important for beginners to understand all these subtleties? I think I could maybe eventually introduce them, but for the first level on the first screen, I don't think throwing a bunch of concepts at them will help with learning. Feel free to re-open the task if you disagree

link

Cogito 4869 days ago

I'll admit that I don't know what your knowledge is, so my explanation might have been directed at the wrong level. I am drafting a pull request that rewrites this for you, hopefully that will be easier to understand!

I don't think that you need to understand all the plumbing, however it is important to understand how the index works, and that git stores the entire snapshot. The way it currently appears makes it seem like every time you switch branches git has to figure out the end state based on deltas, which is not true. Git has fast branch switching precisely because it stores the snapshot in its entirety.

[EDIT] Here is what I wrote if anyone is interested: http://gist.io/4969804

link

xnxn 4869 days ago

Thanks for explaining this. Now I understand what the

    index 1ef56e5..2c756d0

in a diff refers to. I'm now poking around in the object store with `git cat-file -p <hash>` and it's very enlightening.

link

archgoon 4869 days ago

That sounds reasonable, looks like someone has already opened an issue. :)

link