Hacker News new | ask | show | jobs
by mdcox 3929 days ago
I'm really curious as to what exactly you mean by this...does the same code through the same compiler not reliably produce the same binary? I know very little about actual compiler mechanics, but non-deterministic compilation seems really strange to me.
3 comments

It is super strange and as a huge amount of undesirable consequences.

If you compile a "hello world" type C program you should get the same binary when you compile it again provided your toolchain (C library, C compiler, linker etc) are all the same.

However certain C macros like __DATE__ make the binary change (in this case based on the time of compile). Additionally sometimes environment variables like your working directory and your username get into the binary.

Why is this bad?

If the build server for Debian gets hacked or if a developer's machine gets hacked (for some projects), the hackers can modify the binaries. If the program is not reproducible then there is no way to tell that something has gone amiss. If the program can be built reproducibly, someone else can build the code, produce the same binary, and validate it.

This is more scarier in the case of a "Ken Thompson" style hack, where the C compiler binary is modified so that it compiles normally but inserts backdoors in certain libraries, and also inserts its modifications whenever it is building another C compiler.

If the "Ken Thompson" style hack is ever pulled off on a linux distro, there would be no real way to tell without analysing the binaries.

Provided your initial C compiler is good. Having a chain of reproducible builds where each build produces the same binary would prevent against this. Currently we are just producing random binaries and relying on trust which is horrible.

Another case is when you have to release a maintenance fix of a product released a few years ago.

You likely don't have the old toolchain installed, so you reinstall everything, pull a VM, whatever, and then rebuild the last official version.

I would be so much more comfortable if what you just rebuilt had the same md5 that the one deployed on the field. Because even before applying and testing the patch you are not absolutely sure you have rebuilt exactly the same application.

Object code often has the date/time encoded in it, for example.

Debian is updating a lot of their compile/build scripts to make things 100% byte-for-byte identical and you can see that while there isn't a lot of work involved it's still going to take a while to hand-update many thousands of pacakges:

https://wiki.debian.org/ReproducibleBuilds

It depends, technically a compiler output should be deterministic, but apparently there are a few that aren't, one example is any compiler built on Roslyn[1].

Although I think he might have been pointing out the problem of reproducible builds[2].

[1] https://github.com/dotnet/roslyn/issues/372

[2] https://wiki.debian.org/ReproducibleBuilds/About

You should look at NixOS[1] and GuixSD[2].

[1]: https://nixos.org

[2]: http://www.gnu.org/software/guix

It may also be worth mentioning Baserock[1] here,

[1]: http://wiki.baserock.org/