Hacker News new | ask | show | jobs
by adr_ 2201 days ago
Clang 11 hasn't been released yet, right?
3 comments

Right. But we've also observed non-determinism / undefined behavior in Clang 10:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=246630#c26

  ==120363== Conditional jump or move depends on uninitialised value(s)
  ==120363==    at 0x1634474: llvm::ConstantExpr::getGetElementPtr(llvm::Type*, llvm::Constant*, llvm::ArrayRef<llvm::Value*>, bool, llvm::Optional<unsigned int>, llvm::Type*) (Constants.cpp:2191)
  ==120363==    by 0x112D6D9: getGetElementPtr (Constants.h:1163)
  ==120363==    by 0x112D6D9: (anonymous namespace)::SymbolicallyEvaluateGEP(llvm::GEPOperator const*, llvm::ArrayRef<llvm::Constant*>, llvm::DataLayout const&, llvm::TargetLibraryInfo const*) (ConstantFolding.cpp:1005)
  ==120363==    by 0x112DF70: (anonymous namespace)::ConstantFoldInstOperandsImpl(llvm::Value const*, unsigned int, llvm::ArrayRef<llvm::Constant*>, llvm::DataLayout const&, llvm::TargetLibraryInfo const*) (ConstantFolding.cpp:1039)
  ==120363==    by 0x112C165: (anonymous namespace)::ConstantFoldConstantImpl(llvm::Constant const*, llvm::DataLayout const&, llvm::TargetLibraryInfo const*, llvm::SmallDenseMap<llvm::Constant*, llvm::Constant*, 4u, llvm::DenseMapInfo<llvm::Constant*>, llvm::detail::DenseMapPair<llvm::Constant*, llvm::Constant*> >&) [clone .part.0] (ConstantFolding.cpp:1114)
  ==120363==    by 0x112C5CF: llvm::ConstantFoldConstant(llvm::Constant const*, llvm::DataLayout const&, llvm::TargetLibraryInfo const*) (ConstantFolding.cpp:1194)
  ==120363==    by 0x188F410: prepareICWorklistFromFunction (InstructionCombining.cpp:3584)
  ==120363==    by 0x188F410: combineInstructionsOverFunction(llvm::Function&, llvm::InstCombineWorklist&, llvm::AAResults*, llvm::AssumptionCache&, llvm::TargetLibraryInfo&, llvm::DominatorTree&, llvm::OptimizationRemarkEmitter&, llvm::BlockFrequencyInfo*, llvm::ProfileSummaryInfo*, unsigned int, llvm::LoopInfo*) (InstructionCombining.cpp:3703)
  ==120363==    by 0x189205F: runOnFunction (InstructionCombining.cpp:3789)
  ==120363==    by 0x189205F: llvm::InstructionCombiningPass::runOnFunction(llvm::Function&) (InstructionCombining.cpp:3768)
  ==120363==    by 0x16F4352: llvm::FPPassManager::runOnFunction(llvm::Function&) (LegacyPassManager.cpp:1482)
  ==120363==    by 0x16F4DE8: llvm::FPPassManager::runOnModule(llvm::Module&) (LegacyPassManager.cpp:1518)
  ==120363==    by 0x16F51A2: runOnModule (LegacyPassManager.cpp:1583)
  ==120363==    by 0x16F51A2: llvm::legacy::PassManagerImpl::run(llvm::Module&) (LegacyPassManager.cpp:1695)
  ==120363==    by 0x1FF4CFE: EmitAssembly (BackendUtil.cpp:954)
  ==120363==    by 0x1FF4CFE: clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::DataLayout const&, llvm::Module*, clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream> >) (BackendUtil.cpp:1677)
  ==120363==    by 0x2C471A8: clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) (CodeGenAction.cpp:335)
  ==120363==  Uninitialised value was created by a stack allocation
  ==120363==    at 0x112C653: (anonymous namespace)::SymbolicallyEvaluateGEP(llvm::GEPOperator const*, llvm::ArrayRef<llvm::Constant*>, llvm::DataLayout const&, llvm::TargetLibraryInfo const*) (ConstantFolding.c
I’m curious, does clang promise to compile deterministically? (This is inspired by your bug but not directly relevant here, it seems?)
Non-deterministic compilation would be pretty bad.

First, producing different outputs from the same input brings the question of caching tools (e.g ccache, FastBuild), who assume the compiler is a pure function of its inputs.

Moreover, assuming your code and the compiler are correct, you might still end-up with a situation where the performance of the resulting binary differs depending on the planet's alignment at build time.

Worse: when the input code is wrong (which does happens when you're writing new code and trying it on your machine) : you build your code locally, and you're in "luck", as the compiler generates binary code "that won't crash". So you push your modifications, and then you start getting complaints from your coworkers, because they pulled your commit and now they're getting crashes. At this moment you don't know what's happening yet, so you might even tell them "it works on my machine, did you try to rebuild all?". And this might appear to solve the issue, if this time your coworkers are "lucky"!

Finally, let's suppose your code is correct, but the compiler has a code generation bug. Have you ever tried, as a user, to diagnose a compiler bug? You spent many hours trying to minimize the input file that triggers the bug, so it's executable without needing the rest of your project, so you can send it to the compiler devs. I wouldn't even try to do this if I knew the compiler was non-deterministic.

In short, non-deterministic compilation is an invitation for trouble and confusion.

I always thought that big compilers (clang/gcc/msvc) have some degree of non-determinism due to myriad of optimization passes in combination with heuristics deciding when and where to use certain kind of optimization. Is that true or not?

And certainly if you define deterministic compilation as always producing the same binary, this is already broken by compiler macros like "__DATE__" and randomly generated names during link-time optimization. [1]

[1] https://blog.conan.io/2019/09/02/Deterministic-builds-with-C...

It's not true. They are usually deterministic for the purpose of reproducibility. The same compiler version, flags, and sources should produce the same binary.

Yes, __DATE__ in sources can break reproducibility. That does not mean compilers get carte blanche to be nondeterministic.

I would think that profile-guided optimization (PGO) also makes compilation nondeterministic (unless the profile is stored for reuse.)

https://en.wikipedia.org/wiki/Profile-guided_optimization

I've been enabling PGO when building Python lately and I imagine the resulting binaries are a little different every time due to random events during profiling.

Except that gcc and clang are nondeterministic, to an extent

Sometimes the compiler needs to generate a random value and base part of the compilation on that value. Thins like trying to predict which is the best branch, or things done at compile time.

A lot of work has been done to reduce the nondeterminism, and some of it can only be reduced by using things like "-frandom-seed=$your_git_commit" for example

build determinism also goes deeper than that, for example static libraries are archives that include the date of the archive creation, and so on

The simplest programs might generate the same hash, but don't expect all code to generate the same-hash binary by default

> Sometimes the compiler needs to generate a random value and base part of the compilation on that value.

As a compiler engineer with experience (among others) in LLVM and GCC this is the first time I'm hearing of this. Could you provide more details or a source?

I can't imagine where such behavior would be useful, let alone required. The only slightly plausible scenario I can think of would be representing some internal data structures as hash tables with random seeds to avoid denial-of-service attacks. But then the compiled code would still have to rely on, at some point, picking an arbitrary element out of such a hash table. I can't think of contexts inside a compiler where this would be a useful thing to do.

No experience here, but from the manpage:

-frandom-seed=string

  This option provides a seed that GCC uses in place of random numbers in generating certain symbol names that have to be different in every compiled file.  It is also used to place unique stamps in coverage data files and the object files that produce them.  You can use the -frandom-seed option to produce reproducibly identical object files.
though my example/guess on branch prediction is probably wrong
> The simplest programs might generate the same hash, but don't expect all code to generate the same-hash binary by default

Debian and others have put quite a lot of work into reproducible software builds:

https://wiki.debian.org/ReproducibleBuilds#Even_more

This of course only works if the compiler cooperates.

The bug linked earlier is a regression in Clang 10. Clang 9 was deterministic for the same file, flags, etc.

> First, producing different outputs from the same input brings the question of caching tools (e.g ccache, FastBuild), who assume the compiler is a pure function of its inputs.

Why should they have to? Shouldn’t they just be able to reach for any valid compilation of this particular object file and slot it in?

> Moreover, assuming your code and the compiler are correct, you might still end-up with a situation where the performance of the resulting binary differs depending on the planet's alignment at build time.

This is already the case due to your environment. If you have the wrong number of environment variables you might penalize your program’s performance by a significant amount already just because you misalign the stack!

> At this moment you don't know what's happening yet, so you might even tell them "it works on my machine, did you try to rebuild all?". And this might appear to solve the issue, if this time your coworkers are "lucky"!

This sounds like the situation already with nondeterministic bugs like races, albeit with the same binary?

> Have you ever tried, as a user, to diagnose a compiler bug? You spent many hours trying to minimize the input file that triggers the bug, so it's executable without needing the rest of your project, so you can send it to the compiler devs.

I deal with nondeterministic programs all the time…they’re a bit more difficult to file bugs for, but it’s still possible.

Nope, 10.0 was just released recently.
Why is OSSFuzz using such a bleeding edge compiler? That seems a little nuts.
Wouldn't you rather catch bugs before they're released in a stable version?
It is unfair to the authors of the software that is actually tested, in this case SQLite.

You are forced to investigate, otherwise people will attribute the bug to your software.

Toolchain bugs take an amazing amount of time and energy and happen more often than people think.

Exactly. This is precisely the point of nightly builds, is it not?
Clang 11 is still in early development stages. Release date is several months away. Clang 10 was released just a couple of months ago. 11 is expected to be buggy and not fit for use yet.

The SQLite devs now have to deal with "is it or isn't it a compiler bug" nonsense, taking their time away from fixing actual problems, working on features etc, from OSSFuzz deciding to use a compiler that the compiler devs themselves don't think is fit for use.

How much trust can you have that even fuzz results exposed are actually legitimate either? False positives, or worse still false negatives?

If you're going to go down that route, I would expect that they test using both the latest stable version and the whatever unstable version they want. Bugs found using the stable compiler should be reported to the project, while bugs found only using the unstable version should be reported to the compiler.