Hacker News new | ask | show | jobs
by astrange 1477 days ago
The compiler doesn’t know anything about optimizing x86 code either. The actual details there are too secret for Intel to want to accurately describe them in gcc, they’re different across different CPUs, and compilers just aren’t as good as you think they are.

(But CPUs are usually better than you think they are.)

2 comments

It's not so secret, TBH. Usually the intel microarchitecture manuals are detailed enough to describe how many and what type of execution ports there are, how many stages in the pipeline, the size of the reorder buffer, latency of most u-ops, and any frontend hazards. The super secret stuff are things like the design of the branch predictors, memory disambiguation, etc, as well as the low-level tricks to optimize each of these down to the fewest gate delays (for high clockspeeds, etc), as well as where and how they figure out placement, etc.
The front end (the decoder stage and branch predictors) are what would theoretically be important for compilers as they’re the bottleneck. But Intel’s optimization advice doesn’t say much about branches anymore, they pretty much want you to rely on them to take care of it.

That’s only part secrecy and part to give them freedom to change it. It is of course somewhat described in their patents.

There are sometimes vague hints about things to avoid, e.g. putting too many branches on the same cache line, and they usually publish the size of their tables, typically 4K, 8K entries these days? But the actual predictors are wicked devils; they clearly are doing some tournament predictors, using tiny ML modules (perceptrons), and god knows what else. I studied this carefully when trying to make good Spectre gadgets, but it is very very difficult to 100% trick (or utilize!) a branch predictor these days--they just learn in interesting ways...and entries alias :-)

I honestly don't know if it's worth it to try to optimize branch prediction in compilers these days, beyond the obvious step of putting the highest probability target next (for fallthrough prediction) and generally laying out hot parts of the code together. TurboFan and most other dynamically-optimizing compilers put rare code at the end of functions, and that's a huge boost.

Would it not be in Intel's best interest to have popular compilers be able to squeeze the most performance out of its own line of CPUs?

I'm wondering how the incentives play out to keep this stuff private?

Intel sells a compiler. I've only used it briefly a long time ago, but its code generation was well ahead of MSVC at the time even for scalar (non-SIMD) stuff, and I remember GCC was far behind too (it would generate roughly the same performance in microbenchmarks, but far more bloated.)
Intel has released at least some of their software suite for free (as in beer, not as in speech).

https://www.intel.com/content/www/us/en/developer/articles/n...

I think software is not a huge profit center for them.

The original comment presents:

> The actual details there are [1] too secret for Intel to want to accurately describe them in gcc, [2] they’re different across different CPUs, [3] and compilers just aren’t as good as you think they are.

2 and 3 could just be the whole story. Although we haven't actually accumulated any evidence here for 3, given that the original story was about surprisingly getting beat by a compiler, despite performing a seemingly obvious optimization.

You can just look at gcc and llvm source. GCC’s x86 backend is maintained by an Intel employee and it doesn’t have especially detailed descriptions of any cpu in -mtune.

Actually, compiler optimizations like scheduling tend to be neutral to negative on x86 because they increase register pressure. You’d probably want to do “anti-scheduling” and hope the CPU decoder takes care of it if anything.

https://en.wikipedia.org/wiki/Intel_C%2B%2B_Compiler

Interesting that it generates suboptimal code for non-Intel processors.