Hacker News new | ask | show | jobs
by eksith 4629 days ago
This part stood out : "The ideal of 3-4 instructions per clock cycle can only be achieved in code that has a lot of independent instructions."

And a bit later : "3.Modern CPUs do crazy things internally and will happily execute your instruction stream in an order that's wildly different from how it appears in the code."

This may potentially explain why a smaller executable isn't necessarily faster when executing. I guess a lot of compiler gymnastics are devoted to breaking down complex instructions to take advantage of this.

3 comments

In some ways, the actual execution of code is opaque to compilers. Modern x86 processors further divide their instructions into op-codes in the instruction translation units. AMD and Intel both have their approaches to this internal instruction set deeply ingrained into every CPU since perhaps K7 for AMD and Pentium Pro for Intel. Pentium M and later the Core architecture contained op-code fusing where instead of just rearranging op-codes, the op-codes were combined into composite op-codes that could be executed in one step. The opcode fusing + out-of-order execution basically makes the CPU act like a compiler internally for binary. It's a like a JIT run-time for binary that's implemented in hardware.

As far as executable size and performance, compiling with -Os in GCC will occasionally yield a performance increase that might even change across CPU's and architectures as the memory sub-systems hit a good rhythm or there are less misses overall. Usually smaller is better for this. -O3 will occasionally unroll gigantic loops, while using compiler directed optimization to analyze which parts of a binary can benefit overall execution from unrolling vs less misses with smaller executable size can yield even better agreement between memory subsystem performance and execution speed.

Microarchitectures like MIPS have further blind alleys such as branch-delay slots that will finish execution even if a branch instruction -before- the slots is taken. This is an out-of-order program, but putting the burden on the compiler instead of implementing the reordering in hardware actually became a nuisance because the architecture couldn't change how it expected instructions without breaking binary compatibility and the compiler wouldn't have been able to tweak for different CPU's without a fat-binary approach.

Depends on the app, and your use case, and the CPU, etc. YMMV.

For a long time though, the Linux kernel has been compiled to optimize code-size rather than 'performance' (according to GCC). Why? Because the kernel gets involved in every syscall the OS makes, so the kernel code gets paged in and out very frequently. Loading a little less code from RAM means everything goes faster.

Well, it's going to be faster if the smaller executable can keep its entire text segment in memory.

I've done the instruction scheduling stuff by hand on paper; it's pretty interesting. We did Tomasulo scheduling, which is hardly modern, being developed in 1967, but it'll execute your instructions all sorts of ways.