Hacker News new | ask | show | jobs
by xavierd 3909 days ago
A lot of those optimizations would no longer yield any benefits[0]. The CPU archictecture evolved a lot in 16 years, especially in branch/code prediction to the point where a correctly predicted branch (without branch_likely) has almost no cost.

[0]: At least, this is true for x86 CPUs.

3 comments

As a CPU architect, I can confirm that all those except possibly 2) will not yield significant benefits. Prefetching hints will only be useful when the particular code fragment is highly memory-bound because most wide superscalar microarchitectures will easily hide L1/L2 miss latencies.
My qp trie code <http://dotat.at/prog/qp/> got a performance boost of about 20% by adding prefetch hints in the obvious places. The inner loop is fetch / compute / fetch / compute, chaining down into the true. The next fetch will (usually) be some small offset from a pointer we can get immediately after its preceding fetch, so prefetch the base pointer, then compute to work out the exact offset.
If a DDR stall is 50+ cpu cycles, (probably a lot more with today's 2, 3GHz CPU), I am not sure if superscalar microarchitectures would help too much.

At lease in my case of networking packet forwarding app, I had the profiling data to prove that was an issue.

The app code is not that long ~2000 lines of code after clean up. But it have a lot of table looks up (DDR stall) and branches for error condition checks.

Ditto for prefetch instructions:

https://lwn.net/Articles/444336/

A MIPS is probably the exact opposite to modern (which actually means anything P6 and above) x86 CPUs in terms of performance characteristics. If I were to guess what member of the x86 family might actually benefit from such optimisation, it would be NetBurst (which itself has very different performance characteristics from every other x86 family that came before or after it.)

I was trying to optimize for a network app. The goal of trying to get to 1 million pps. At that time 200Mz CPU, 1 cache miss is 50+ cycles. or 25% of the CPU budget, prefetch helped a lot in that case.
I've occasionally wondered how long it takes highly optimized C/C++ to be surpassed by optimizing compilers due to CPU advancement and the optimizations either making compiler optimization harder, or the optimizations target assumptions about CPU architecture that are no longer valid.

That is, what is the shelf life of a very low level CPU optimization for Intel hardware.

Well while strictly not on topic, there was this talk recently on micro-optimisations.

https://www.youtube.com/watch?v=nXaxk27zwlk

While it seemed to cover more of the compiler optimizations and how to do some low level benchmarking and optimizing and wasn't really addressing when those might become obsolete, it was really interesting and informative. Thanks!