| HN Mirror

Sure. That's an interesting example. Does the cost of the extra instructions completely disappear in an 11 cycle branch stall that occurs when the ret instruction executes? Maybe. Or perhaps the pipeline is able to pre-fetch the return address and execute the memory fetch concurrently with the three ALU-only instructions. I don't honestly know. But I do think it's unwise to make performance claims without actually profiling the code.

I will concede that the generated code looks horrible. But it is not immediately clear to me that the generated code is significantly actually worse. Has the compiler produced ugly code that is actually not worse than the pretty code, once the execution pipeline model has been applied? Maybe. Conceivably. A problem that can be simply solved by profiling the code.

My personal experience with hand optimizing code on modern Intel processors is that things that one intuitively expects to improve execution speed don't necessarily produce actual performance improvements. And that one should never make such optimizations without careful profiling of the results.

True I suppose that there is going to be a vanishingly tiny penalty for cache pollution. But if the performance of your performance-critical code section depends on how often cache-misses occur, then you have a problem that's between one and two orders of magnitude worse than anything you're going to get from instruction optimization, given a 7 to 100+ cycle penalty for a cache miss, and you should probably be looking for other ways to optimize the code.

When benchmarking, cache misses are never going to happen. And I think the same is broadly generally true for actual performance-critical code in actual use too. If the code is truly performance critical, then its unlikely that cache misses will occur. Or unlikely that a very occasional cache miss isn't interleaved with a hundreds of cache hits. And if they do occur, the solution will to be to optimize data access patterns, not produce better code optimizations.