| The hard part -- sometimes really hard -- is the "bit-for-bit identical" requirement. Lots of builds are recreatable but not reproducible (there is probably a better term of art here). You can go back to a point in time and build the version of the software as it was, but you are not guaranteed to get a bit-for-bit clone. (See https://reproducible-builds.org for a thorough discussion) The problem is that there are lots of uncontrolled inputs to a build that are due to sourcecode or compiler changes. Most famously there are timestamps and random numbers, which mess up all sorts of hashing-based approaches. These can even be non-obvious. Just the other day I and a colleague were investigating the (small but unsettling) possibility that an old buildpack had been replaced maliciously. We compared the historical hash to the file: different. We rebuilt the historical buildpack with trusted inputs: still different. Then we unzipped both versions and diff'd the directories: identical. What had thrown our hashes off was that zipfiles, by default, include timestamps. We have a build that is recreatable but not reproducible. Speaking of builds, we are able to reproducibly build some binaries but not others. Off the top of my head our most high-profile non-reproducible build is NodeJS. Some other binaries (Ruby and Python, in my not-at-all-complete recollection) are fully reproducible. This difficulty with fully reproducing makes it hard to provide a fully trustworthy chain of custody. A company which uses Cloud Foundry have in actual fact stood up an independent copy of our build pipelines inside their own secure network, so that they can be completely autarkic for the build steps leading to a complete buildpack. This doesn't defend against malicious source, but it defends against malicious builds. Disclosure: I work for Pivotal, the majority donor of engineering to Cloud Foundry. As you've probably guessed, I'm currently a fulltime contributor on the buildpacks team. |
The biggest conceptual mistake we are making is that by default compilers always build for ~this~ machine, linking to this libraries. This makes it so the state of the machine inherently changes with every compilation (aka compiling is not a purely functional operation anymore). If I could go back time and change automake and glibc, cross compiling and explicit dependency handling should be the norm. (As an aside, containers would greatly benefit too as you wouldn't need to package an entire linux distribution with every binary)
I am sometimes amazed, sometimes disappointed by this reproduceability problem. Computers supposed to be machines that can do the same thing again and again without a mistake, but this is not the case anymore. We have so many layers of complexity and everything is bolted together with duct tape. We focus on developer convenience in the short term but in the long term we completely loose determinism. Sure we can write more code faster than before, but building software is more problematic than ever.
Yet, somehow everything seems to be going to this direction, in fact some people celebrate it and compare it to biology or evolution. I just call is "accidentally stochastic computing".