| HN Mirror

I'm not sure that pushing the complexity to the compiler makes as much sense.

One good side of x86 style instruction sets is that there is a lot you can do in the cpu to optimize existing programs more. While some really advanced compiler optimizations may make some use of the internal details of the implementation to choose what sequence to output, these details are not part of the ISA, and thus you can change them without breaking backwards compatibility. Changing them could slow down certain code optimized with those details in mind, but the code will still function. And I'm not even talking just about things like out of order execution. Some ISA's leak enough details that just moving from multi-cycle in-order execution to pipelined execution was awkward.

This ability of the implementation to abstract away from the ISA is very handy. And some RISC processors that exposed implementation details like branch-delay slots ended up learning this lesson the hard way. Now the Itanium ISA does largely avoid leaking the implementation details like number of scalar execution units or similar, but it's design does make certain kinds of potential chip-side optimizations more complicated.

In the Itanium ISA the compiler can specify groups of instructions that can run in parallel, specify speculative and advanced loads, and set up loop pipelining. But this is still more limited than what x86 cores can do behind the scenes. For an Itanium style design, adding new types of optimizations generally requires new instructions and teaching the compilers how to use them, since many potential optimizations could only be added to the chip if you add back the very circuitry that you were trying to remove by placing the burden on the compiler.

Even some of the types of optimizations Itanium compilers can do that mimic optimizations x86 processors do behind the scenes can result in needing to write additional code, reducing the effectiveness of the instruction cache. This is not surprising. The benefits of static scheduling are that you pre-compute things that are possible to pre-compute like which instructions can run in parallel, and where you can speculate etc. And thus you don't need to compute that stuff on-die, and don't need to compute it each and every time you run a code fragment. But obviously that information still needs to make it to the CPU, so you are trading that runtime computation for additional instruction storage cost. (I won't deny that the result could still end up more I-cache efficient than x86 is, because x86 is not by any means the most efficient instruction encoding, especially since some rarely used anymore opcodes hog some prime encoding real-estate.)

Basically I'm not sold on static scheduling for high performance but general purpose CPUs, and am especially not sold on the sort of peudo-static scheduling used by Itanium where you are scheduling for instructions with unknown latency, that can differ from model to model. The complete static scheduling where you must target the exact CPU you will run on, and thus know all the timings (like the Mill promised) feels better to me. (But I'm not entirely sure about install type specialization like they mention.)

But I'm also no expert on CPU design, a hobbyist at best.