| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rbanffy 2669 days ago
	Any program with a build system designed in such way it doesn't introduce anything beyond the source code into the binary should be. If you build the same code on two different machines, using the same compiler, with the same options, then the generated binaries should be exactly the same.

3 comments

rocqua 2669 days ago

> If you build the same code on two different machines, using the same compiler, with the same options, then the generated binaries should be exactly the same.

There is so much context that is normally embedded into a binary that this is usually not true unless explicit measures have been taken.

Two very common sources that introduce variability are time-stamps used in the build, and environment variables such as $HOME and $USER.

link

bregma 2668 days ago

If you're generating or modifying source code at build time (eg. adding timestamps or build IDs) then you have violated the constraints on build reproducibility.

link

detaro 2668 days ago

If you define the problem as excluding things a large percentage of real-world build systems do by default, then it's not very interesting. The interesting part of Debian's and others work here is making this work with small, unintrusive changes to such systems.

link

rbanffy 2668 days ago

As long as the intrusive changes are taken upstream, I've no problem with it.

link

anth_anm 2668 days ago

what if you have a multi-threaded backend to the compiler that happens to lay down data in different orders?

link

jabl 2668 days ago

You don't even need multi-threading. In gcc we had at least one case where a key=>value data structure was keyed by memory address, causing symbols to be emitted in different order depending on ASLR, phase of the moon, or whatever.

link

tantalor 2668 days ago

That's a bug.

https://en.wikipedia.org/wiki/Race_condition

link

lixtra 2668 days ago

Why?

Most compilers give no guarantees in which order they lay out the data. I love deterministic processes as much as everyone. But randomized approaches have their advantages too. And if a compiler has reasons to randomize output e.g. for speed than it’s a trade off to consider.

link

anth_anm 2668 days ago

Thread finishes work

grabs lock

writes to file

writes to index

releases lock

That's not a race condition. The output order doesn't matter, but it is nondeterministic.

link

tedunangst 2668 days ago

Why is it a bug? I write a program to download four files. I do so in parallel. Sometimes X finishes first, sometimes Y finishes first, and the files are written to disk in a different order. Why do I want to serialize this operation?

link

rbanffy 2668 days ago

The end result is a set of four files. You don't care about the order they are laid out on the disk and the next steps shouldn't let the order of those files influence the end result.

Let's assume there is a latent bug in the compiler that gets triggered if file four is the first one. Good luck debugging that.

link

aidenn0 2668 days ago

Don't forget the absolute paths of the source files...

link

anth_anm 2668 days ago

It always bugged me thats considered part of reproducibility.

That's 100% controllable and deterministic.

link

aidenn0 2668 days ago

Until very recently you needed root access to do it on linux (user namespaces can let you do it without root).

link

biggerfisch 2669 days ago

Until a build process starts naming things with timestamps, locales, etc. Just because the build is "source code only" doesn't mean it is deterministic.

link

rbanffy 2668 days ago

That's why I wrote "it doesn't introduce anything beyond the source code into the binary". Unfortunately, I forgot to emphasize the anything.

A build process that names things with timestamps or leaks your locale into the build configuration (or doesn't pin build-time dependency versions) will make the build depend on things other than the source code (both program and build settings) you made available.

It may even be desirable for it to be non-reproductible - if, for instance, you want to use optimizations targeted to your specific system, then your build system will have to introduce the architecture information into the build process and your build will result in a unique binary that targets your own machine.

link

rfoo 2668 days ago

Unfortunately, if we take this definition of "anything" literally, it is impossible to build such a build system.

For example, depending on the input order, linker may produce different output. Surely you can sort the object files, but the sorted object files order is still effectively "stored" into the binary, and that's not source code.

You can only normalize such things (like in the example above, sorting), you can not eliminate them, they naturally exist.

link

rbanffy 2668 days ago

> you can sort the object files, but the sorted object files order is still effectively "stored" into the binary, and that's not source code.

No, but the order should be explicitly defined in the build scripts or the result will not be deterministic.

If the order triggers, say, a linker bug that makes one in 50 builds crash, execution will not be deterministic and that's really, really bad.

link

vishvananda 2668 days ago

This is actually an annoying challenge of reproducible builds. In many cases it is actually useful to have a build timestamp, git sha, or build number available for debug output from the program. I've often gone as far as embedding a sha and/or timestamp into a file on export into a tgz which allows it to be reproducible from the tarfile, although builds directly out of source control would not be.

link

maccam94 2668 days ago

Git hashes can be inserted in reproducible builds, they are deterministic.

link

anth_anm 2668 days ago

Compilers haven't been built with that as a condition, so this isn't true.

It's not true in practical code either, people like to stick in timestamps.

It's not ever true on windows, unless you use the fairly recent PE header changes.

link