|
|
|
|
|
by eksith
4629 days ago
|
|
This part stood out : "The ideal of 3-4 instructions per clock cycle can only be achieved in code that has a lot of independent instructions." And a bit later : "3.Modern CPUs do crazy things internally and will happily execute your instruction stream in an order that's wildly different from how it appears in the code." This may potentially explain why a smaller executable isn't necessarily faster when executing. I guess a lot of compiler gymnastics are devoted to breaking down complex instructions to take advantage of this. |
|
As far as executable size and performance, compiling with -Os in GCC will occasionally yield a performance increase that might even change across CPU's and architectures as the memory sub-systems hit a good rhythm or there are less misses overall. Usually smaller is better for this. -O3 will occasionally unroll gigantic loops, while using compiler directed optimization to analyze which parts of a binary can benefit overall execution from unrolling vs less misses with smaller executable size can yield even better agreement between memory subsystem performance and execution speed.
Microarchitectures like MIPS have further blind alleys such as branch-delay slots that will finish execution even if a branch instruction -before- the slots is taken. This is an out-of-order program, but putting the burden on the compiler instead of implementing the reordering in hardware actually became a nuisance because the architecture couldn't change how it expected instructions without breaking binary compatibility and the compiler wouldn't have been able to tweak for different CPU's without a fat-binary approach.