Hacker News new | ask | show | jobs
by magicalhippo 1477 days ago
> In the post, multiplications and 2 additions are not faster than 2 additions.

Reminded me of when I tried to optimize some code using assembly on x86, maybe 8-10 years go. The compiler spit out a DIV [EDI] instruction which was inside a hot loop. I saw how I could easily rewrite it to use a register instead of the indirect memory access.

So I did, and it was significantly slower, like 10-15%... I even took care to align the loop target address and so on. If I changed my code to use the memory indirection it was measurably faster.

And with that I decided hand-optimizing x86 assembly was definitely a skill to leave behind.

2 comments

It would be interesting to test it on 486, Pentium, and Pentium Pro, as you get much different execution and caching behavior between those architectures.
You probably added an extra instruction or two to put the value in a register? The CPU can already split memory accesses into uops and cached accesses are fast, so there's no point in doing that because it'll just waste an additional register (vs. using one of the many the renamer generates) and add instructions to decode.

x86 is fundamentally a CISC; if you treat it like a RISC, it will definitely disappoint.

Not really, the compiler did the round-trip into the local variable (memory), I did it via register.

I asked around at the time and someone mentioned that I might have overtaxed certain execution ports or something like that, but yeah that just cemented my belief that x86 optimization is not my cup of tea anymore. Better to spend time learning how to write code the compiler can optimize well.

It's hard to know exactly without staring at the code and knowing the exact CPU, because nothing stands out as a red flag. If you used an extra register, maybe you caused a spill (pushed something else out of registers into memory). Maybe you made the loop code bigger and it no longer fit in the loop stream buffer (if that model had one). Maybe you hit a weird frontend decode issue and it could only decode N-1 instructions per clock in that line instead of N, and that was critical to the loop's performance. Maybe your code layout changed for another reason, or the memory layout, and you got some bad cache aliasing.

These things are knowable if you have enough curiosity and maybe masochism :)

What made me give up was when I found that an optimization for one x86 processor was an impairment on another. Your DIV example might have had a different outcome on an AMD chip, or on the next- or previous-generation part from Intel.

Nowadays, counting cycles is a game for monks who don't actually have to get anything done.

The compiler doesn’t know anything about optimizing x86 code either. The actual details there are too secret for Intel to want to accurately describe them in gcc, they’re different across different CPUs, and compilers just aren’t as good as you think they are.

(But CPUs are usually better than you think they are.)

It's not so secret, TBH. Usually the intel microarchitecture manuals are detailed enough to describe how many and what type of execution ports there are, how many stages in the pipeline, the size of the reorder buffer, latency of most u-ops, and any frontend hazards. The super secret stuff are things like the design of the branch predictors, memory disambiguation, etc, as well as the low-level tricks to optimize each of these down to the fewest gate delays (for high clockspeeds, etc), as well as where and how they figure out placement, etc.
The front end (the decoder stage and branch predictors) are what would theoretically be important for compilers as they’re the bottleneck. But Intel’s optimization advice doesn’t say much about branches anymore, they pretty much want you to rely on them to take care of it.

That’s only part secrecy and part to give them freedom to change it. It is of course somewhat described in their patents.

There are sometimes vague hints about things to avoid, e.g. putting too many branches on the same cache line, and they usually publish the size of their tables, typically 4K, 8K entries these days? But the actual predictors are wicked devils; they clearly are doing some tournament predictors, using tiny ML modules (perceptrons), and god knows what else. I studied this carefully when trying to make good Spectre gadgets, but it is very very difficult to 100% trick (or utilize!) a branch predictor these days--they just learn in interesting ways...and entries alias :-)

I honestly don't know if it's worth it to try to optimize branch prediction in compilers these days, beyond the obvious step of putting the highest probability target next (for fallthrough prediction) and generally laying out hot parts of the code together. TurboFan and most other dynamically-optimizing compilers put rare code at the end of functions, and that's a huge boost.

Would it not be in Intel's best interest to have popular compilers be able to squeeze the most performance out of its own line of CPUs?

I'm wondering how the incentives play out to keep this stuff private?

Intel sells a compiler. I've only used it briefly a long time ago, but its code generation was well ahead of MSVC at the time even for scalar (non-SIMD) stuff, and I remember GCC was far behind too (it would generate roughly the same performance in microbenchmarks, but far more bloated.)
Intel has released at least some of their software suite for free (as in beer, not as in speech).

https://www.intel.com/content/www/us/en/developer/articles/n...

I think software is not a huge profit center for them.

The original comment presents:

> The actual details there are [1] too secret for Intel to want to accurately describe them in gcc, [2] they’re different across different CPUs, [3] and compilers just aren’t as good as you think they are.

2 and 3 could just be the whole story. Although we haven't actually accumulated any evidence here for 3, given that the original story was about surprisingly getting beat by a compiler, despite performing a seemingly obvious optimization.

https://en.wikipedia.org/wiki/Intel_C%2B%2B_Compiler

Interesting that it generates suboptimal code for non-Intel processors.

Cached accesses are not fast anymore. Modern chips can retire a dozen or even more instructions in the time it takes to move a word in from L1 cache.