| HN Mirror

Realistically there is only so much optimization a compiler can do without having execution profiles of code. There is a HUGE amount of optimization a compiler can do if it knows how likely various branches are to be taken. For example, consider a simple switch statement. There are basically two ways to do code generation: you can use a jump table, or the case can be rewritten as if it's a series of if/else statements (i.e. by generating a cascade of compare/jump instructions). And in the latter case, what order the case statements are placed by the compiler will make a huge difference in performance, as you want the most likely case statements to come first. Note that I'm just mentioning switch statements here as they're easy to understand, this type of problem is ubiquitous when it comes to code generation decisions made by the compiler.

Back to our example: if the compiler doesn't have any information about how likely different cases are then it can't actually know what is going to be fastest. So your compiler just uses some heuristics to guess and pick something that is reasonable (but is often not optimal). This problem arises all over the place in optimization passes: the compiler could generate code X or code Y, but if it's just looking at your source code it doesn't know for sure which is going to be faster, so it just chooses something based on an arbitrary heuristic. If strategy X performs better than strategy Y 90% of the time, then the compiler is still making the wrong decision 10% of the time.

This is the fundamental reason why having an extra slow "max optimization" mode doesn't work. The reason the compiler doesn't generate better code isn't necessarily because the compiler doesn't have enough time to optimize things, it's because it's working with limited information. In my experience (from C++), the compilation times with clang vary only a small amount when compiling with -O1 vs -O2 vs -O3 (maybe 5%). And in fact in clang there are many optimization passes that are not enabled by default at -O3. The reason they aren't enabled is because they usually have dubious benefit (i.e. equally as likely to improve performance as to worsen performance).

I'm not a Rust programmer, but at least in C++ land the first step to generating optimized binaries should be using -O3 and microarch flags (e.g. -march), and the next most important thing is using FDO to optimize builds. Using FDO to optimize builds will have a much bigger impact on performance than fiddling with other optimization flags. FDO works by collecting branch profiles from compiled binaries, and then feeding that information back into the compiler so it can make better code generation and layout decisions. You can gather FDO profiles on any binary (on Linux you just need to collect branch events using perf), so if you really want better optimization that's my recommendation.