Hacker News new | ask | show | jobs
by jacques_chester 3542 days ago
The hard part -- sometimes really hard -- is the "bit-for-bit identical" requirement.

Lots of builds are recreatable but not reproducible (there is probably a better term of art here). You can go back to a point in time and build the version of the software as it was, but you are not guaranteed to get a bit-for-bit clone. (See https://reproducible-builds.org for a thorough discussion)

The problem is that there are lots of uncontrolled inputs to a build that are due to sourcecode or compiler changes. Most famously there are timestamps and random numbers, which mess up all sorts of hashing-based approaches.

These can even be non-obvious. Just the other day I and a colleague were investigating the (small but unsettling) possibility that an old buildpack had been replaced maliciously. We compared the historical hash to the file: different. We rebuilt the historical buildpack with trusted inputs: still different.

Then we unzipped both versions and diff'd the directories: identical.

What had thrown our hashes off was that zipfiles, by default, include timestamps. We have a build that is recreatable but not reproducible.

Speaking of builds, we are able to reproducibly build some binaries but not others. Off the top of my head our most high-profile non-reproducible build is NodeJS. Some other binaries (Ruby and Python, in my not-at-all-complete recollection) are fully reproducible.

This difficulty with fully reproducing makes it hard to provide a fully trustworthy chain of custody. A company which uses Cloud Foundry have in actual fact stood up an independent copy of our build pipelines inside their own secure network, so that they can be completely autarkic for the build steps leading to a complete buildpack. This doesn't defend against malicious source, but it defends against malicious builds.

Disclosure: I work for Pivotal, the majority donor of engineering to Cloud Foundry. As you've probably guessed, I'm currently a fulltime contributor on the buildpacks team.

6 comments

Nixos goes to a great length to steer towards reproduceability. They run builds in chroots, they set all ctime to past 0, they make all directories except the output directory read only, etc. But even then compilers have all sorts of quirks, like running certain code paths multiple times and deciding which one runs faster on ~this~ CPU.

The biggest conceptual mistake we are making is that by default compilers always build for ~this~ machine, linking to this libraries. This makes it so the state of the machine inherently changes with every compilation (aka compiling is not a purely functional operation anymore). If I could go back time and change automake and glibc, cross compiling and explicit dependency handling should be the norm. (As an aside, containers would greatly benefit too as you wouldn't need to package an entire linux distribution with every binary)

I am sometimes amazed, sometimes disappointed by this reproduceability problem. Computers supposed to be machines that can do the same thing again and again without a mistake, but this is not the case anymore. We have so many layers of complexity and everything is bolted together with duct tape. We focus on developer convenience in the short term but in the long term we completely loose determinism. Sure we can write more code faster than before, but building software is more problematic than ever.

Yet, somehow everything seems to be going to this direction, in fact some people celebrate it and compare it to biology or evolution. I just call is "accidentally stochastic computing".

> some people celebrate it and compare it to biology or evolution.

Creating life is scientifically exhilarating, but incredibly dangerous.

This is a problem that the GNU Guix package manager[0] (and presumably its inspiration Nix[1]) are helping to solve. Any two git checkouts of Guix with the same git hash on the same architecture should produce bit-identical builds across time and space for many of the programs it packages. It's not true for everything they package yet, but they're making progress.

(Some might find the documentation for the `guix challenge` command interesting: https://www.gnu.org/software/guix/manual/html_node/Invoking-...)

[0] https://www.gnu.org/software/guix/ [1] https://nixos.org/nix/

Interestingly, this problem appears in some pretty diverse circumstances.

The one that springs to mind is video game emulation, from more than a decade ago - the communities there needed a means to reproducibly compress their ROMs, both for verifying dumps and for sharing large sets of ROMs. The tools created have been through many iterations, but they're still in use today in the form of TorrentZip[0] and various relatives like torrent7z.

[0]: https://github.com/uwedeportivo/torrentzip

Well I know what tool I'll be looking at on Monday :)
There are some efforts [1] to make reproducible builds really work, also nix guys have some experience with them, as others have noted. Isolated deterministic environments and stripping binaries/archives (strip-nondeterminism tool) [2] generally do the trick.

[1] https://reproducible-builds.org

[2] https://reproducible-builds.org/tools/

Some of my predecessors on buildpacks went through a bunch of work to establish reproducibility for binaries we ship, with varied levels of success:

"Investigate how we can allow users to independently verify/authenticate a final buildpack" (https://www.pivotaltracker.com/story/show/104469634)

"Explore: Compiled binaries should be reproducible" (https://www.pivotaltracker.com/story/show/104746074)

"determine whether the libfaketime reproducible build strategy will work across all of our binaries" (https://www.pivotaltracker.com/story/show/107752798)

"Investigate Why are our node builds not reproducible?" (https://www.pivotaltracker.com/story/show/128161137)

As well as supporting work to help independent verification of the "chain of custody". There's 25 of those under that label, if you use the search box.

Bit-for-bit reproducibility does require changing how you do things. Always use checksums, not dates. For zip we have a special version that sets all the timestamps in the archive to a fixed value. Anything writing records to a file based on a hash table needs to sort the entries. The team needs to be dedicated to making all its tools work this way.

The good part is that you have a clear goal.

Next time, use diffoscope to track down the exact differences.
Thanks for the tip!