Hacker News new | ask | show | jobs
by neatcoder 2663 days ago
What software currently has 100% reproducible builds?
7 comments

What software? Like, individual packages? Many of them - here's the ones that do so on Arch: https://tests.reproducible-builds.org/archlinux/archlinux.ht...

If you mean which distribution has 100% of its packages reproducible, probably none yet. But Arch and Debian are both making progress.

To be fair:

> 56 (100.0%) out of 56 built NetBSD files were reproducible

That is not really comparable to debian's 26475/28522 for buster https://tests.reproducible-builds.org/debian/reproducible.ht...

But is it "reproducible" or reproducible? Holger still considers the debian numbers "reproducible" as we are only building things twice. To achieve proper reproducible builds the artifacts needs to be reproducible by users. User facing tools needs to be provided and I have yet to see how NetBSD provides this.
As far as I understand your pages that applies only to the kernel, not the whole distribution. The Linux kernel is reproducible as well: https://tests.reproducible-builds.org/archlinux/archlinux.ht...
Package management-wise, Nix.

OS-wise, NixOS.

Build system-wise, there are lots of options: Blaze, Buck, Pants, Please (AFAIK)

Unfortunately, nix does not produce fully reproducible builds. The build environment is portable and produced in a way that it can be repeated, but due to the limitations of the software that is being built, the builds are not binary reproducible. You can see some commentary on the nix team hoping to adopt some of the work being done by debian et al here: https://github.com/NixOS/nixpkgs/issues/9731
There is also https://r13y.com/
Nice!

In case any Nixers are reading this, here is how I got NSPR to build reproducibly in Guix:

https://git.savannah.gnu.org/cgit/guix.git/commit/?id=6d7786...

Interesting, thanks for sharing!
NixOS currently isn't there. A lot of this work is done, but much still remains. We have benefited greatly from Debian's work, though (Debian maintainers frequently come across as happy upstream participants to fix issues like this in the ecosystem, which really helps everyone!)

https://r13y.com/ tracks the progress of NixOS reproducibility; currently we're at 98.23% bit-for-bit identical for our minimal installer ISO. After that, we'll need the graphical installer, and then more of the base package set. So we've still got a ways to go.

Any program with a build system designed in such way it doesn't introduce anything beyond the source code into the binary should be.

If you build the same code on two different machines, using the same compiler, with the same options, then the generated binaries should be exactly the same.

> If you build the same code on two different machines, using the same compiler, with the same options, then the generated binaries should be exactly the same.

There is so much context that is normally embedded into a binary that this is usually not true unless explicit measures have been taken.

Two very common sources that introduce variability are time-stamps used in the build, and environment variables such as $HOME and $USER.

If you're generating or modifying source code at build time (eg. adding timestamps or build IDs) then you have violated the constraints on build reproducibility.
If you define the problem as excluding things a large percentage of real-world build systems do by default, then it's not very interesting. The interesting part of Debian's and others work here is making this work with small, unintrusive changes to such systems.
As long as the intrusive changes are taken upstream, I've no problem with it.
what if you have a multi-threaded backend to the compiler that happens to lay down data in different orders?
You don't even need multi-threading. In gcc we had at least one case where a key=>value data structure was keyed by memory address, causing symbols to be emitted in different order depending on ASLR, phase of the moon, or whatever.
Why?

Most compilers give no guarantees in which order they lay out the data. I love deterministic processes as much as everyone. But randomized approaches have their advantages too. And if a compiler has reasons to randomize output e.g. for speed than it’s a trade off to consider.

Thread finishes work

grabs lock

writes to file

writes to index

releases lock

That's not a race condition. The output order doesn't matter, but it is nondeterministic.

Why is it a bug? I write a program to download four files. I do so in parallel. Sometimes X finishes first, sometimes Y finishes first, and the files are written to disk in a different order. Why do I want to serialize this operation?
Don't forget the absolute paths of the source files...
It always bugged me thats considered part of reproducibility.

That's 100% controllable and deterministic.

Until very recently you needed root access to do it on linux (user namespaces can let you do it without root).
Until a build process starts naming things with timestamps, locales, etc. Just because the build is "source code only" doesn't mean it is deterministic.
That's why I wrote "it doesn't introduce anything beyond the source code into the binary". Unfortunately, I forgot to emphasize the anything.

A build process that names things with timestamps or leaks your locale into the build configuration (or doesn't pin build-time dependency versions) will make the build depend on things other than the source code (both program and build settings) you made available.

It may even be desirable for it to be non-reproductible - if, for instance, you want to use optimizations targeted to your specific system, then your build system will have to introduce the architecture information into the build process and your build will result in a unique binary that targets your own machine.

Unfortunately, if we take this definition of "anything" literally, it is impossible to build such a build system.

For example, depending on the input order, linker may produce different output. Surely you can sort the object files, but the sorted object files order is still effectively "stored" into the binary, and that's not source code.

You can only normalize such things (like in the example above, sorting), you can not eliminate them, they naturally exist.

> you can sort the object files, but the sorted object files order is still effectively "stored" into the binary, and that's not source code.

No, but the order should be explicitly defined in the build scripts or the result will not be deterministic.

If the order triggers, say, a linker bug that makes one in 50 builds crash, execution will not be deterministic and that's really, really bad.

This is actually an annoying challenge of reproducible builds. In many cases it is actually useful to have a build timestamp, git sha, or build number available for debug output from the program. I've often gone as far as embedding a sha and/or timestamp into a file on export into a tgz which allows it to be reproducible from the tarfile, although builds directly out of source control would not be.
Git hashes can be inserted in reproducible builds, they are deterministic.
Compilers haven't been built with that as a condition, so this isn't true.

It's not true in practical code either, people like to stick in timestamps.

It's not ever true on windows, unless you use the fairly recent PE header changes.

The vast majority of code written at Google does, for one.
Bit for bit reproducible?

So I can checkout an arbitrary version from years ago and reproduce the exact same set of output files?

Yes, look up Google Bazel (the open source version of Blaze)
Right, that's not a bit for bit reproducible builds.

Think of it as absent a cache I should get bit for bit identical out (perhaps ignoring logs and such).

I think GNU Guix offers what you are after.

I maintain my own build farm and tried comparing my results against the official CI server:

  $ guix challenge --substitute-urls="https://ci.guix.info"
  14,224 store items were analyzed:
    - 4,972 (35.0%) were identical
    - 265 (1.9%) differed
    - 8,987 (63.2%) were inconclusive
Of the 5237 build artifacts that were available on the substitute server, only 265 (5%) differed.

All of these items can be (and have been) built entirely from source, starting with Guix' initial "binary seeds", on (probably) different hardware and kernel compared to the CI system.

I don’t think “one artifact, one vote” is a fair way to measure this.

One reason builds become irreproducible is when a build is multi-threaded, and the order in which artifacts are combined into larger ones becomes unpredictable. That problem doesn’t exist, or at least is a lot smaller, for ‘leaf’ artifacts (example: if your C compiler is single-threaded, and you run make multi-threaded, individual object files do not have the ordering problem, but libraries built from multiple object files do)

On the other hand, a single static struct with a padding “hole” that isn’t consistently written that happens to end up in lots of binaries will decrease your percentage a lot.

What does guix mean by "inconclusive"?
GuixSD, NixOS
I believe Solaris is as well.