|
> Sure, the code is strange, but it is not necessarily inefficient. Out of the 6 pieces of Assembly code in the article, 2 of them are definitely inefficient - specifically, the 2 clang ones that contain irrelevant writes to the stack. Even if a CPU was smart enough to ignore those instructions with no performance penalty (which in itself is doubtful), at the very least those instructions take up space in memory/caches unnecessarily. The gcc output when arraySize is 3 is almost certain to be inefficient as well, when you look at portions such as: mov eax, 1
test eax, eax
sete al
ret
All this code is doing is to set eax to 0 and then returning. This could be done by simply replacing it with "xor eax, eax ; ret" or "mov eax, 0 ; ret" if there's a reason to avoid "xor" - there's already a mov there. The code as present also has the side effect of changing the CPU's flags, but this side effect can't be relied on as we return immediately, and flag values are not part of the returned values with this ABI.So yes, in general benchmarking is the only way to be sure. But when you look at the specifics of the generated code, we can see that at best 4 of the 6 snippets of Assembly code are optimal, and the actual number of optimal snippets is probably lower than 4 (my best guess is 2 here). All that said, I might benchmark everything later on and post a new article about it. > Also worth mentioning in passing: if you are not compiling with --march=native, all your code is being optimized for some prehistoric ancient least-common-denominator Intel processor, probably a 1990's-era 486, that nobody actually has anymore that has god-only-knows what inadequacies in its execution pipeline. So make sure you are. Yep: See https://news.ycombinator.com/item?id=46978577 |
I will concede that the generated code looks horrible. But it is not immediately clear to me that the generated code is significantly actually worse. Has the compiler produced ugly code that is actually not worse than the pretty code, once the execution pipeline model has been applied? Maybe. Conceivably. A problem that can be simply solved by profiling the code.
My personal experience with hand optimizing code on modern Intel processors is that things that one intuitively expects to improve execution speed don't necessarily produce actual performance improvements. And that one should never make such optimizations without careful profiling of the results.
True I suppose that there is going to be a vanishingly tiny penalty for cache pollution. But if the performance of your performance-critical code section depends on how often cache-misses occur, then you have a problem that's between one and two orders of magnitude worse than anything you're going to get from instruction optimization, given a 7 to 100+ cycle penalty for a cache miss, and you should probably be looking for other ways to optimize the code.
When benchmarking, cache misses are never going to happen. And I think the same is broadly generally true for actual performance-critical code in actual use too. If the code is truly performance critical, then its unlikely that cache misses will occur. Or unlikely that a very occasional cache miss isn't interleaved with a hundreds of cache hits. And if they do occur, the solution will to be to optimize data access patterns, not produce better code optimizations.