Hacker News new | ask | show | jobs
by Thorrez 907 days ago
>In general, build cache objects are usually addressed by a content-addressable-hash

How does that work? I would think the simplest case of a build object that needs to be cached is a .o file created from a .c file. The compiler sees the .c file and can determine its hash, but how can the compiler determine the hash of the .o file to know what to look up in the cache? I think the compiler would need to perform the lookup using the hash of the .c file, which isn't a hash of the data in the cache.

4 comments

In the case of the Remote Execution/Cache API used by Bazel among others[1] at least, it's a bit more detailed. There's an "ActionCache" and an actual content-addressed cache that just stores blobs ("ContentAddressableStorage"). When you run a `gcc -O2 foo.c -o foo.o` command (locally or remotely; doesn't matter), you upload an "Action" into the action cache, which basically said "This command was run. As a result it had this stderr, stdout, error code, and these input files read and output files written." The input and output files are referenced by the hash of their contents, in this case, and they get uploaded into the CAS system.

Most importantly you can look up an action in the ActionCache without actually running it, provided you have the inputs at hand. So now when another person comes by and runs the same build command, they say "Has this Action, with these inputs, been run before?" and the server can say "Yes, and the output is a file identified by hash XYZ" where XYZ is the hash of foo.o, so you can just instantly download it from the CAS.

So there are a few more moving parts to make it all work. But the system really is ultimately content-addressed, for the most part.

[1] https://github.com/bazelbuild/remote-apis/blob/main/build/ba...

If you're only using remote caching (ie no remote execution) then all cache clients need to trust each other, because a malicious client can upload any result it wants to a given ActionCache key, and there's no way to verify the ActionCache entries are correct unless the actions are reproducible. (And verifying ActionCache entries by rerunning the actions kind of defeats the purpose of using a build cache.)

If you don't want clients to have to trust each other, then you can block ActionCache write access to the clients and add remote execution. In this setup clients upload an action to the CAS, remote executors run the action and then upload the result to the ActionCache, using the hash of the action as the key. This way malicious clients can't spoof cache results for other clients, because other clients won't ever look for the malicious action's key in the ActionCache.

In Bazel’s case and other cases, build cache objects are held in CAS and then referenced from other keys. I believe BuildXL from Microsoft also works this way.

Of course one other advantage to build caches is they are verifiable: the intent is to produce the exact same output as a normal call, and that’s easily checked on the client side.

No question that build caching poses inherent supply chain risks though and that’s part of what we want to solve. I think people are hesitant to trust build caching for good reason until there are safer mechanisms and better cryptographic patterns applied.

Yep, aseipp, and we support the full gRPC interface for remote caching offered by Bazel, including the newer APIs.

Explained better than I could for sure. I find it very interesting how BuildXL and Bazel ended up at similar models for this problem. I don’t yet know the history of which informed which.

(As compared to, say, Gradle, which works based on input hashes instead.)

When a .o is stored in the cache it is associated with the hash of the .c file