Hacker News new | ask | show | jobs
by ahmedfromtunis 970 days ago
Stupid question as I never worked on something like this before: why isn't reproducibility the default behavior?

I mean if 2 copies of a piece of software were compiled from the same source, what stops them from being identical each and every time?

I know there are so many moving parts, but I still can't understand how discrepancies can manifest themselves.

8 comments

There are many specific causes, time stamps probably being the most common issue. You can see a list of common issues here:

https://reproducible-builds.org/docs/

The main overall issue is that developers don't test to ensure they reproduce. Once it's part of the release tests it tends to stay reproducible.

I agree, although I wouldn't describe the overall issue as developers not testing to ensure reproducibility. The reason most builds aren't reproducible is that build reproducibility isn't a goal for most projects.

It would be great if 100% of builds were reproducible, but I don't believe developers shouldn't be testing for reproducibility unless it's a defined goal.

As generalized reproducible build tooling (guix, nix, etc.) becomes more mainstream, I imagine we'll see more reproducible builds as adoption grows and reproducibility is no longer something developers have to "check for", but simply rely upon from their tooling.

It's also because the cost of making things reproducible is still too high.

We have the tooling, but it still takes a bit of effort from the developer's side to integrate those into their CI pipeline.

Eventually we will get to a place where this will be the default. It will be integrated into day-to-day tooling like `cargo release`, `npm publish`, ...

Typo: I don't believe developers shouldn't be -> I don't believe developers should be
Loads of things. Obvious ones where the decision is explicitly taken to be non-reproducible include timestamps and authorship information. There are also other places where reproducibility is implicitly broken by default: e.g. many runtimes don't define the order of entries in a hashmap, and then the compiler iterates over a hashmap to build the binary.
I can see why devs would want "This Software was built on 10/10/2007 by bob7 from git hash aaffaaff" to appear on the splash screen of software.

How do you get similar behaviour while having a reproducible build?

Can you, for example, have the final binary contain a reproducible part, and another section of the elf file for deliberately non-reproducible info?

if you have a reproducible build, then the notion of "software was built on date by user" is kind of useless information, no? Because it does not matter - if you can verify that a specific git hash of a codebase results in a particular binary through reproducible builds, a malicious adversary could have built it yesterday and given it to me and i can be almost surely confident (barring hash-collisions...) it's identical to a known trusted team member building it.

Having information about which git has was used, as well as the time it was published, is part of the source distribution so an output can contain references to these inputs and still be deterministic w.r.t. those inputs.

If you REALLY want to know when/who built something, you could add in an auxiliary source file which contains that information, which is required to build. Which is essentially what compilers which leverage current time do anyway, it's just implicit.

The usecase is: user wants an easy way to know, from the GUI of some running software, exactly what build/version/git commit/branch/date they're running - perhaps to file a bug report for example.

The actual build date doesn't matter if the software is reproducible - but its a proxy for 'how out of date is this software'.

If you actually had reproducible builds, the build date would not tell you anything about how out of date the software is—you would only need the date of the source code the binary was built from. By definition, the binary you'd get from building a version of the source today would be identical to the version you'd get building it the day that version of the source was finished.
In that case, you can report the Git SHA and still be reproducible.
You can still include the git commit id and its date on the build.
Your source would also have to have a reference to which exact version of which compiler to use, which versions of which external headers to use, etc. and now you're inventing Nix.

Conceivably there could be a standard for a sidecar file to specify how something was built (e.g. nixpkgs commit hash, or all of the parameters that went into the build). Or content address the inputs, i.e. invent Nix again.

So we could solve this problem by having everyone standardize on using Nix.

Such standards do exist: https://slsa.dev/spec/v1.0/provenance
Yeah, "who built this" information belongs in a signing certificate that accompanies the build artefact, not in the artefact itself. The Git hash can certainly appear in the binary (it's a reproducible part of the build input), and the date can instead be e.g. the commit date, which is probably more relevant to a user anyway.
Much as I like Git, I'm not sure I like the idea of the artefacts depending on the git commit and therefore on the entire git history. I rather feel the artefacts should only depend on the actual source and not on a particular version control system used for storing the source.
You're welcome to include full sources, or not-tied-to-git directions to acquire them, with your release binaries.

Regardless, whether or not you do that is a discussion of distribution format, not binary reproducibility. Your distribution can contain as much (or as little) additional material as you like along with your release binaries.

Absolutely. I'm just stating my personal preference, that's all.
You can still include the git hash or a git tag/release version info, since the reproducer has the same git repo anyway.

But including timestamp of build would necessitate “spoofing” the timestamp by the reproducer to be the same as the original.

Parallelism. There might be actions that are not order-independent, and the state of the CPU might result in slightly different binaries, but all are correct.
Why does this matter though? Why does order of compilation result in a different binary?
Just some random, made up example: say you want to compile an OOP PL that has interfaces and implementations of that. You discover reachable implementations through static analysis, which is multi-threaded. You might discover implementations A,B,C in any order — but they will get their methods placed in the jump table based on this order. This will trivially result in semantically equivalent, but not binary-equivalent executables.

Of course there would have been better designs for this toy example, but binary reproducibility is/was usually not of the highest priority historically in most compiler infrastructures, and in some cases it might be a relatively big performance regression to fix, or simply just a too big refactor.

Because order of completion of the parallel tasks is not guaranteed, if all tasks write to the same file you might get a different result each time.
> There might be actions that are not order-independent, and the state of the CPU might result in slightly different binaries, but all are correct.

Well no: that's really the thing reproducible packages are showing: there's only one correct binary.

And it's the one that's 100% reproducible.

I'd even say that that's the whole point: there's only one correct binary.

I'll die on the hill that if different binaries are "all correct", then none are: for me they're all useless if they're not reproducible.

And it looks like people working on entire .iso being fully bit-for-bit reproducible are willing to die on that hill too.

"Correct" does not mean "reproducible" just because you think lowly of irreproducible builds.

A binary consisting of foo.o and bar.o is correct whether foo.o was linked before bar.o or vice versa, provided that both foo.o and bar.o were compiled correctly.

See my reply to the sibling post — binary reproducibility is not the end goal. It is an important property, and I do agree that most compiler toolchains should strive for that, but e.g. it might not be a priority for, say, a JIT compiler.
Here is a very recent post from the Go team on things they had to do to make the Go toolchain fully reproducible.

https://go.dev/blog/rebuild

Sometimes it's randomized algorithms, sometimes it's performance (e.g. it might be faster not to sort something), sometimes it's time or environment-dependent metadata, sometimes it's thread interleaving, etc.
a very common one is pointer values being different from run to run and across different operating systems. Any code that intentionally or accidentally relies on pointer values will be non-deterministic
Would be nice if you could explain how/why this happens, given that normally, pointers aren't persisted.
Languages such as Standard ML and others (Scheme? Lisp? Not sure...) have implementations that can save the current state of the heap into a binary.

This is used in theorem provers, for example, so that you don't have to verify proofs of theorems over and over again (which can be very slow).

Instead, you verify them once, save the state of the heap to disk (as a binary ELF, for instance) and then you can run the binary to continue exactly where you left off (i.e. with all the interesting theorems already in memory, in a proved state).

This is what the HOL4 theorem prover's main `hol` script does, i.e. it runs HOL4 by loading such a memory state from disk, with the core theories and theorems already loaded.

Presumably, to make this reproducible you'd need to make sure that all the memory objects are saved to disk in a deterministic order somehow (e.g. not in memory address order, as it can change from run to run, especially when using multiple threads).

Edit: Presumably you'd also need to make sure that you persist the heap when all threads are idle and in a known state (e.g. with all timers stopped), to avoid random stack states and extraneous temporary allocations from being persisted, which would also affect the resulting binary.

Thanks, yeah. So I guess the concrete example I would cite here is that the most natural (and most efficient?) way of persisting std::map<ptr, ....> would introduce pointer ordering into the output.
Just like the most natural (and most efficient?) way of persisting any std::unordered_map<...> can result in a completely randomly-ordered output, due to a DoS mitigation that some commonly-used language runtimes have.
I think they meant if you cast a pointer to an integer, do some math on that and then store that. Then you will a stored result that will likely differ from run to run
That sounds like runtime differences not a difference between two binaries
The difference in binaries must be caused by some runtime difference of a compiler.
That’s runtime behavior
A surprising amount of compiler and program behavior depends on how pointer values compare.

These comparisons don't have to go the same way for everything to be correct.

I don't develop enough to give a particularly good answer, but one example I've heard of involves timestamps

Imagine the program uses the current date or time as a value. When compiled at different moments, the bits change.

Same applies to anything where the build environment or timing influences the output binary

Laziness and carelessness of compiler developers.
As others have mentioned, there’s sorting issues (are directory entries created in the same order for a project that compiled everything in a directory?), timestamps (archive files and many other formats embed timestamps), and things that you really want to be random (tmpdir on Linux [at least in the past] would create directories of varying length).

I’ve successfully built tools to compare Java JARs that required getting around two of those and other test tools that required the third. I’m sure there are more.