Hacker News new | ask | show | jobs
by frutiger 439 days ago
Your description (including the detailed description in the reply) seems to be missing the crucial difference that git uses - the hash code of the object is not some GUID, it is literally the hash of the content of the object. This makes a big difference as you don't need some central registry that maps the GUID to the object.
2 comments

Every git repo has a copy of that mapping instead of there being a central registry though, and because the commit author's name and email, and the date of the commit and a commit message (among other things) go into the hash that represents a commit, it's not that big a difference, is it? Given a collection of files, but not the git repo they're from, and libgit, I can't say if those files match a git tag hash if I don't also have the metadata that makes up the commit to make the git hash, and not just the files inside of it.
Yes, but the commit object (which includes metadata) references a tree object by its hash. The tree object is a text representation of a directory tree, basically, referencing file blobs by hash. So yes, you can recognize identical files between commits. It's true there's no fast indexing: if you want to ask the question "which commits contain exactly this file?" you have to search every commit. But you don't need to delta the file contents itself.
but people don't use the file hash, that's internal to git. I go to the centralized repository of repositories at github.com and look up tagged version 1.0.0 of whatever software, which refers to a git tag which references a commit hash (which yes it references a tree object as you said).
"People" don't commonly use them, no. But it's a real and documented API to do this (see e.g. https://git-scm.com/book/en/v2/Git-Internals-Git-Objects).

And in any case you had a specific requirement above ("Given a collection of files, but not the git repo they're from, and libgit, I can't say if those files match a git tag hash"), and in fact this can be done!

The git tag hash references a commit. Without the commit metadata, you don't have a tree object and thus don't know any hashes. You can take the files on disk and compute the hash and furthermore you can take that hash and make a tree object. but without the commit, all you can say is you have a tree object, you don't have a tree object for the commit in question to compare it to.
Read the link. You can extract the commit object and get the tree ref trivially. You can also enumerate commits (literally what "git log" is for). The only thing missing from the process is a fast reverse index going backwards from blob to tree to commit. But that can be generated in just a few seconds even for the largest repositories (16s to do a full git log of Linux on my box, for example).

I'm at a loss. You keep saying something can't be done, but it can, and it's not even hard.

That's for human consumption though, which is what frustrates so many "hashing will solve everything!" schemes - it breaks as soon as you need a bug fix.

At the end of the day none of us want "exactly this hash" we want "latest". Exact hashes and other reproducibility are things which are useful when debugging or providing traceability - valuable but also not the human side of the equation.

There doesn't need to be a single central repository, there can be many partial ones. But if they are merged, they won't collide.

The GUID can certainly be a hash.

> The GUID can certainly be a hash.

It can’t be, because a GUID is supposed to be a globally unique. The point is, it needs to instead be the hash of the content.

This can’t be an afterthought.

UUID versions 3 and 5 are derived from hashes (MD5 and SHA1 respectively).
GUID and UUID are different.
The RFC defining them says they're the same and has since the earliest draft I can find, also from 2002. You should offer more explanation when you take a stance contrary to what is well documented.
A hash is not globally unique. I'm not sure what more explanation is needed.
How so? I thought they are the same, at least almost.

Tremulous (ioquake3 fork) had GUIDs from qkeys.

https://icculus.org/pipermail/quake3/2006-April/000951.html

You can see how qkeys are generated, and essentially a GUID is:

  Cvar_Get("cl_guid", Com_MD5File(QKEY_FILE, 0), CVAR_USERINFO | CVAR_ROM);
So, in this case, GUID is the MD5 hash of the generated qkey file. See "CL_GenerateQKey" for details.

> On startup, the client engine looks for a file called qkey. If it does not exist, 2KiB worth of random binary data is inserted into the qkey file. A MD5 digest is then made of the qkey file and it is inserted into the cl_guid cvar.

UUIDs have RFCs, GUIDs apparently do not, but AFAIK UUIDs are also named GUIDs, so...

Bitkeeper maybe somewhat of a precedent (2000)?