Hacker News new | ask | show | jobs
by bradfitz 330 days ago
My recent horror from some git work was discovering how git sorts its tree objects.

The docs just say to sort by C locale (byte-order sorting). Easy. Except git was sometimes rejecting my packfiles as being bogus per its fsck code, saying my trees were misordered.

TURNS OUT THERE'S AN UNDOCUMENTED RULE: you need to append an implicit forward slash to directory tree entry names before you sort them.

That forward slash is not encoded in the tree object, nor is the type of the entry. You just put the 20 byte SHA1 hash, which is to either a blob or a hash (or a commit for submodules).

So you can have one directory with directory "testing" and file "testing.md" and it'll sort differently than a directory with two files "testing" and "testing.md".

You can see a repro at https://gist.github.com/bradfitz/4751c58b07b57ff303cbfec3e39...

(So to verify whether a tree object is formatted correctly, you need to have the blobs of all the entries in the tree, at least one level)

2 comments

I've had this exact bug happen to me when I implemented my git clone.

The way I found out was that Github kept rejecting my push, because as I later discovered, my git history was invalid precisely due to entries being sorted improperly due to the forward slash requirement. I could have solved this with the real git, but the point was to use my tool exclusively for version control from inception, so I just deleted the .git folder. So, my git history appears to begin near the end of the whole cycle. But I did manage to learn a lot, both about git and about the language I implemented it in.

> directory tree entry names

But... git doesn't really store directories, does it?

I wrote a longer comment saying this (deleted now since I was wrong).

Turns out that Git does somewhat store dirs (in form of trees). See https://git-scm.com/book/en/v2/Git-Internals-Git-Objects (section "Tree Objects").

To understand op's repro look at the last two lines (objects in the tree) in each of their command outputs, not the files shown in the first few lines.

What I think op means is that the `testing` tree pointed in their first example is sorted after `testing.md` even though it's only called `testing` because it's being sorted as `testing/` and `/` is > `.` bytewise.

I'm not at a computer right now but it would be nice to test it with files named `testing.` and `testing0` since they are adjacent bytewise and would show the implicit forward slash more clearly with the tree object sitting between them.

This makes me wonder why Git can't just store an empty tree for empty dirs.

EDIT: did the Gist https://gist.github.com/alvaro-cuesta/bd0234e3e1a66819c7e9e9...

Notice the `git cat-file -p HEAD^{tree}` outputs.

> This makes me wonder why Git can't just store an empty tree for empty dirs.

tl;dr: it can (see my other comment) and the empty tree is hardcoded. But since the index works with file paths and blobs, having no file means that there's no entry in the index

Yes it does, it just doesn't store empty directories.
It can store empty directories (actually, trees). It can't do normally because the index maps paths to blobs, an empty directory doesn't have a file to map to a blob and then `git add` will have no effect. Given that normally we write commits from the index content, then normally we won't find an empty tree.

You can run `git commit --allow-empty` with an empty index and the root tree will be the empty tree:

   $ git init
   $ git commit --allow-empty -m foo
   $ git rev-parse @^{tree}
   4b825dc642cb6eb9a060e54bf8d69288fbee4904
4b825dc is the empty tree. And a funny thing about it is that it is hardcoded in Git, and you can use it without having this object:

   $ git init
   $ git commit-tree -m foo 4b825dc642cb6eb9a060e54bf8d69288fbee4904
   $ tree .git/objects # you'll see that there's no file for the empty tree
This is a good reading about that weird object: https://matheustavares.dev/posts/empty-tree
You can perfectly easily put the empty tree object as a tree object's child, this just isn't supported and some parts of Git will break.