Hacker News new | ask | show | jobs
by WalterBright 439 days ago
Around 2002 or so, I had an idea to tag every part of a project with a unique hash code. With a hash code, one could download the corresponding file. A hash code for the whole project would be a file containing a list of hash codes for the files that make up the project. Hash codes could represent the compiler that builds it, along with the library(s) it links with.

I showed it to a couple software entrepreneuers (Wild Tangent and Chromium), but they had no interest in it.

I never did anything else with it, and so it goes.

7 comments

I had actually done a writeup on it, and thought I had lost it. I found it, dated 2/15/2002:

---

Consider that any D app is completely specified by a list of .module files and the tools necessary to compile them. Assign a unique GUID to each unique .module file. Then, an app is specified by a list of .module GUIDs. Each app is also assigned a GUID.

On the client's machine is stored a pool of already downloaded .module files. When a new app is downloaded, what is actually downloaded is just a GUID. The client sees if that GUID is an already built app in the pool, then he's done. If not, the client requests the manifest for the GUID, a manifest being a list of .module GUIDs. Each GUID in the manifest is checked against the client pool, any that are not found are downloaded and added to the pool.

Once the client has all the .module files for the GUIDs that make up an app, they can all be compiled, linked, and the result cached in the pool.

Thus, if an app is updated, only the changed .module files ever need to get downloaded. This can be taken a step further and a changed .module file can be represented as a diff from a previous .module.

Since .module files are tokenized source, two source files that differ only in comments and whitespace will have identical .module files.

There will be a master pool of .module files on WT's server. When an app is ready to release, it is "checked in" to the master pool by assigning GUIDs to its .module files. This master pool is what is consulted by the client when requesting .module files by GUID.

The D "VM" compiler, linker, engine, etc., can also be identified by GUIDs. This way, if an app is developed with a particular combination of tools, it can specify the GUIDs for them in the manifest. Hence the client will automatically download "VM" updates to get the exact tools needed to duplicate the app exactly.

yeah, allow me to introduce you to the Nix whitepaper, which is essentially this, and thus worth a read for you:

https://edolstra.github.io/pubs/nspfssd-lisa2004-final.pdf

Another possibly related idea is the language Unison:

https://www.unison-lang.org/

Thank you. Looks like my idea precedes Nix by 2 years!
NixOS may end up being "the last OS I ever use" (especially now that gaming is viable on it):

https://nixos.org/

Check it out. The whitepaper's a fairly digestible read, too, and may get you excited about the whole concept (which is VERY different from how things are normally done, but ends up giving you guarantees)

The problem with NoxOS is all the effort to capture software closures is rendered moot by Linux namespaces, which are a more complete solution to the same problem.

Of course we didn't have them when the white paper was written, so that's fair but technology has moved on.

Nix(OS) is aware of namespaces, and can use them (in fact, the aforementioned gaming support relies on them), but versioning packages still works better than versioning the system in most cases.

Consider three packages, A, B, and C. B has two versions, A and C have one.

- A-1.0.0 depends on B-2.0.0 and C-1.0.0. - C-1.0.0 depends on B-1.0.0.

If A gets a path to a file in B-2.0.0 and wants to share it with C (for example, C might provide binaries it can run on files, or C might be a daemon), it needs C to be in a mount namespace with B-2.0.0. However, without Nix-store-like directory structure, mounting B-2.0.0's files will overwrite B-1.0.0's, so C may fail to start or misbehave.

I dont think thats true. How would you compile a program that has conflicting dependencies with a linux namespace?
Linux namespaces and Nix closures solve different problems at different stages of the software lifecycle. Namespaces isolate running processes; Nix closures guarantee build-time determinism and reproducibility across systems.

Namespaces don’t track transitive dependencies, guarantee reproducible builds, enable rollback, or let you deploy an exact closure elsewhere. They’re sandboxing tools—not package management or infra-as-code.

If anything, the two are complementary. You can use Nix to build a system with a precise closure, and namespaces to sandbox it further. But calling namespaces a "more complete solution" is like calling syscall filtering a replacement for source control.

Also, minor historical nit: most namespaces existed by the late 2000s; Nix’s whitepaper was written after that. So the premise isn’t even chronologically correct.

Sounds like it's also halfway to a version of Nix designed specifically for D toolchains, too, using GUIDs instead of hashing inputs.
It wasn't designed specifically for D toolchains, that was just an example of what it could do.
Using a hash (content-addressable) instead of a GUID (random ID for each version) is a big difference, though
Interesting. I thought calling a program an "app" came with the smartphone era much later.
People called things like Lotus 1-2-3 “killer apps” in the 1980s.

A reference from 1989:

https://books.google.com/books?id=CbsaONN5y1IC&pg=PP75#v=one...

Your description (including the detailed description in the reply) seems to be missing the crucial difference that git uses - the hash code of the object is not some GUID, it is literally the hash of the content of the object. This makes a big difference as you don't need some central registry that maps the GUID to the object.
Every git repo has a copy of that mapping instead of there being a central registry though, and because the commit author's name and email, and the date of the commit and a commit message (among other things) go into the hash that represents a commit, it's not that big a difference, is it? Given a collection of files, but not the git repo they're from, and libgit, I can't say if those files match a git tag hash if I don't also have the metadata that makes up the commit to make the git hash, and not just the files inside of it.
Yes, but the commit object (which includes metadata) references a tree object by its hash. The tree object is a text representation of a directory tree, basically, referencing file blobs by hash. So yes, you can recognize identical files between commits. It's true there's no fast indexing: if you want to ask the question "which commits contain exactly this file?" you have to search every commit. But you don't need to delta the file contents itself.
but people don't use the file hash, that's internal to git. I go to the centralized repository of repositories at github.com and look up tagged version 1.0.0 of whatever software, which refers to a git tag which references a commit hash (which yes it references a tree object as you said).
"People" don't commonly use them, no. But it's a real and documented API to do this (see e.g. https://git-scm.com/book/en/v2/Git-Internals-Git-Objects).

And in any case you had a specific requirement above ("Given a collection of files, but not the git repo they're from, and libgit, I can't say if those files match a git tag hash"), and in fact this can be done!

The git tag hash references a commit. Without the commit metadata, you don't have a tree object and thus don't know any hashes. You can take the files on disk and compute the hash and furthermore you can take that hash and make a tree object. but without the commit, all you can say is you have a tree object, you don't have a tree object for the commit in question to compare it to.
That's for human consumption though, which is what frustrates so many "hashing will solve everything!" schemes - it breaks as soon as you need a bug fix.

At the end of the day none of us want "exactly this hash" we want "latest". Exact hashes and other reproducibility are things which are useful when debugging or providing traceability - valuable but also not the human side of the equation.

There doesn't need to be a single central repository, there can be many partial ones. But if they are merged, they won't collide.

The GUID can certainly be a hash.

> The GUID can certainly be a hash.

It can’t be, because a GUID is supposed to be a globally unique. The point is, it needs to instead be the hash of the content.

This can’t be an afterthought.

UUID versions 3 and 5 are derived from hashes (MD5 and SHA1 respectively).
GUID and UUID are different.
The RFC defining them says they're the same and has since the earliest draft I can find, also from 2002. You should offer more explanation when you take a stance contrary to what is well documented.
How so? I thought they are the same, at least almost.

Tremulous (ioquake3 fork) had GUIDs from qkeys.

https://icculus.org/pipermail/quake3/2006-April/000951.html

You can see how qkeys are generated, and essentially a GUID is:

  Cvar_Get("cl_guid", Com_MD5File(QKEY_FILE, 0), CVAR_USERINFO | CVAR_ROM);
So, in this case, GUID is the MD5 hash of the generated qkey file. See "CL_GenerateQKey" for details.

> On startup, the client engine looks for a file called qkey. If it does not exist, 2KiB worth of random binary data is inserted into the qkey file. A MD5 digest is then made of the qkey file and it is inserted into the cl_guid cvar.

UUIDs have RFCs, GUIDs apparently do not, but AFAIK UUIDs are also named GUIDs, so...

Bitkeeper maybe somewhat of a precedent (2000)?
Isn't this basically... a Merkle Tree, the underlying storage architecture of things like git and Nix?

https://en.wikipedia.org/wiki/Merkle_tree

Except that instead of a GUID, it's just a hash of the binary data itself, which ends up being more useful because it is a natural key and doesn't require storing a separate mapping

Yep, invented in 1979 and also the core data structure of crypto block chains
I'd never heard of a Merkle Tree before, thanks for the reference.
While I get how that's like git, it sounds even closer to unison:

https://softwaremill.com/trying-out-unison-part-1-code-as-ha...

20 years later :-)
Hey Walter, what would you improve with Git?
Git hasn't quite taken the step of making the hash the URL you use to download a file, any file, and be assured it is exactly what you thought it was, as the hash of the file must match its URL.

This is currently done in a haphazard way, not particularly organized.

git over ipfs then?
I believe that's approximately what this is trying to do https://radicle.xyz/#:~:text=radicle%20is%20an%20open%20sour... although evidently using a custom protocol not ipfs itself
So you invented nix :-D
Similarly but I also had rsync or rdiff as a central character in my mental model of a VCS.