Hacker News new | ask | show | jobs
by cogman10 1490 days ago
> Itanium held the idea that we could accurately predict ILP at compile time (when the halting problem clearly states that we cannot).

I don't know where these notions are coming from.

Compilers can (and do) reorder instructions to extract as much parallelism as possible. Further, SIMD has forced most compilers down a path of figuring out how to parallelize, at the instruction level, the processing of data.

Further, most CPUs now-a-days are doing instruction reordering to try and extract as much instruction level parallelism out as possible.

Figuring out what instructions can be run in parallel is a data dependency problem, one that compilers have been solving for years.

Side note: the instruction reordering actually poses a problem for parallel code. Language writers and compiler writers have to be extra careful about putting up "fences" to make sure a read or write isn't happening outside a critical section when it shouldn't be.

3 comments

If your assertion had any weight at all, EPIC would have taken over.

> Compilers can (and do) reorder instructions to extract as much parallelism as possible. Further, SIMD has forced most compilers down a path of figuring out how to parallelize, at the instruction level, the processing of data.

Peephole optimizations are literally just rewrite rules and very limited in what they can accomplish, but we can't find an even moderately reliable way to optimize larger bits of the program. Auto-vectorization is still so bad that even unskilled devs can probably do a better job by hand.

> Further, most CPUs now-a-days are doing instruction reordering to try and extract as much instruction level parallelism out as possible.

This is true and proves my point rather than yours. If the compiler could do the job, then the VLIW output would be faster and not require OoO execution. It's telling that the fastest versions of Itanium were the ones that took the incoming VLIW commands and ripped them apart into a traditional OoO instruction window effectively negating the whole idea while preserving the externally-facing ISA.

> Figuring out what instructions can be run in parallel is a data dependency problem, one that compilers have been solving for years.

If they solved it years ago, then why do we get such MASSIVE ILP boosts from bigger instruction windows? Why is 2-3 instructions of throughput the maximum efficiency we can get from in-order systems?

There are few issues with Itanium-like architectures.

The first thing to point out is that the dynamic filling of the execution units in superscalar hardware will always do no worse than whatever a pure-compiler solution can do, and will very frequently do better. Hardware can take advantage of dynamic opportunities, such as the ability to fill execution slots from code both before and after a branch (or even across function boundaries!), or being more responsive to instructions with data-dependent execution times. Yes, this does take not-insignificant amounts of hardware. But given the limitations of what compilers can statically do, it's not clear that you can put the savings to better use.

The second issue is that such an arrangement usually ends up with the hardware encoding microarchitectural details into the ISA. And when you do that, and you desire to change microarchitecture, you're stuck with either changing the ISA and dealing with attendant issues, or you have to add the hardware that you're theoretically saving in the first place.

On top of this, you're struck with practical performance being driven by the availability and adoption of sufficiently smart compilers, which is largely out of your control.

It's worth noting that you can ameliorate these issues to a larger degree if you restrict your inputs to a more structured subset of possible programs, i.e., you try to build an accelerator instead of a general-purpose CPU. And that's why you see more interesting architectures come out in the accelerator space. But for most general-purpose programs, you're not really going to do better than modern superscalar architectures, even with all the space and power they consume.

The critical difference is that EPIC (the architecture model of Itanium) essentially exposed CPU pipelines naked to the code - so you didn't just have to reorder instructions as optimizers do today, you also had to figure out changes that experience so far suggests is doable either in hw with runtime-only data, or in very tight numerical code. This includes compiler taking the place of branch predictor as well as OOOE scheduling, as well as no on-cpu instruction reordering or out of order retirement, and IIRC a branch mispredict was quite costly.

More over, EPIC pretty much meant thar you couldn't apply similar chip-level IPC improvements as you could elsewhere, at least originally.

I'm not sure that branch prediction would need to go to the compiler, but definitely agree it'd likely subsume the OOOE scheduling (at very least, it'd be less effective).

That, though, seems like it might make for a good power/performance tradeoff. Those circuits aren't free. We just didn't get to the point where compilers were doing a good job of that OOOE reordering (not until after EPIC died).

The real reason, though, that itanium died (IMO) is most businesses insisted on emulating their x86 code at a 70% performance cost. So costly that it seems like intel/hp spent most of their hardware engineering budget making that portion fast enough.

The x86 emulator built into Itanium 1 was very bad, yes, but it didn't matter that much outside of workstation use. HP build Itanium 2 without it, and provided software emulators for x86 and HP-PA that worked apparently "well enough".

The real deal breaker was Itanium being ridiculously expensive and quickly destroying any possibility of increased market by pricing itself out of it - and even in the markets that had the money, it was considered overpriced (nicest thing I heard about Itanium was "overpriced DSP masquerading as general purpose CPU"). I remember reading intel's published roadmaps before news about amd64 landed - We would be running 32bit x86 much longer under it, with Itanium being kept at extra premium prices.

Even customers that had Itanium as the only upgrade path available - thanks to HP - found the performance so bad - on natively compiled code! - they effectively forced HP to produce Alpha till Itanium was pretty much confirmed dead and the customers migrated out of HP vendor-locked stack (at one of the largest mobile telcos in Poland we migrated from Alpha to IBM POWER, many OpenVMS customers kept buying/hoarding Wildfire and Marvel architecture servers).