| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by xavierd 3909 days ago
	A lot of those optimizations would no longer yield any benefits[0]. The CPU archictecture evolved a lot in 16 years, especially in branch/code prediction to the point where a correctly predicted branch (without branch_likely) has almost no cost. [0]: At least, this is true for x86 CPUs.

3 comments

e5f34f89 3909 days ago

As a CPU architect, I can confirm that all those except possibly 2) will not yield significant benefits. Prefetching hints will only be useful when the particular code fragment is highly memory-bound because most wide superscalar microarchitectures will easily hide L1/L2 miss latencies.

link

fanf2 3906 days ago

My qp trie code <http://dotat.at/prog/qp/> got a performance boost of about 20% by adding prefetch hints in the obvious places. The inner loop is fetch / compute / fetch / compute, chaining down into the true. The next fetch will (usually) be some small offset from a pointer we can get immediately after its preceding fetch, so prefetch the base pointer, then compute to work out the exact offset.

link

tkinom 3908 days ago

If a DDR stall is 50+ cpu cycles, (probably a lot more with today's 2, 3GHz CPU), I am not sure if superscalar microarchitectures would help too much.

At lease in my case of networking packet forwarding app, I had the profiling data to prove that was an issue.

The app code is not that long ~2000 lines of code after clean up. But it have a lot of table looks up (DDR stall) and branches for error condition checks.

link

userbinator 3909 days ago

Ditto for prefetch instructions:

https://lwn.net/Articles/444336/

A MIPS is probably the exact opposite to modern (which actually means anything P6 and above) x86 CPUs in terms of performance characteristics. If I were to guess what member of the x86 family might actually benefit from such optimisation, it would be NetBurst (which itself has very different performance characteristics from every other x86 family that came before or after it.)

link

tkinom 3908 days ago

I was trying to optimize for a network app. The goal of trying to get to 1 million pps. At that time 200Mz CPU, 1 cache miss is 50+ cycles. or 25% of the CPU budget, prefetch helped a lot in that case.

link

kbenson 3909 days ago

I've occasionally wondered how long it takes highly optimized C/C++ to be surpassed by optimizing compilers due to CPU advancement and the optimizations either making compiler optimization harder, or the optimizations target assumptions about CPU architecture that are no longer valid.

That is, what is the shelf life of a very low level CPU optimization for Intel hardware.

link

daemin 3908 days ago

Well while strictly not on topic, there was this talk recently on micro-optimisations.

https://www.youtube.com/watch?v=nXaxk27zwlk

link

kbenson 3907 days ago

While it seemed to cover more of the compiler optimizations and how to do some low level benchmarking and optimizing and wasn't really addressing when those might become obsolete, it was really interesting and informative. Thanks!

link

matt_d 3904 days ago

There's been an interesting talk on this a few months ago:

- http://blog.cr.yp.to/20150314-optimizing.html

- (PDF) http://cr.yp.to/talks/2015.04.16/slides-djb-20150416-a4.pdf

Discussions:

- https://news.ycombinator.com/item?id=9202858

- https://news.ycombinator.com/item?id=9396950

link