|
|
|
|
|
by cogman10
1490 days ago
|
|
> Itanium held the idea that we could accurately predict ILP at compile time (when the halting problem clearly states that we cannot). I don't know where these notions are coming from. Compilers can (and do) reorder instructions to extract as much parallelism as possible. Further, SIMD has forced most compilers down a path of figuring out how to parallelize, at the instruction level, the processing of data. Further, most CPUs now-a-days are doing instruction reordering to try and extract as much instruction level parallelism out as possible. Figuring out what instructions can be run in parallel is a data dependency problem, one that compilers have been solving for years. Side note: the instruction reordering actually poses a problem for parallel code. Language writers and compiler writers have to be extra careful about putting up "fences" to make sure a read or write isn't happening outside a critical section when it shouldn't be. |
|
> Compilers can (and do) reorder instructions to extract as much parallelism as possible. Further, SIMD has forced most compilers down a path of figuring out how to parallelize, at the instruction level, the processing of data.
Peephole optimizations are literally just rewrite rules and very limited in what they can accomplish, but we can't find an even moderately reliable way to optimize larger bits of the program. Auto-vectorization is still so bad that even unskilled devs can probably do a better job by hand.
> Further, most CPUs now-a-days are doing instruction reordering to try and extract as much instruction level parallelism out as possible.
This is true and proves my point rather than yours. If the compiler could do the job, then the VLIW output would be faster and not require OoO execution. It's telling that the fastest versions of Itanium were the ones that took the incoming VLIW commands and ripped them apart into a traditional OoO instruction window effectively negating the whole idea while preserving the externally-facing ISA.
> Figuring out what instructions can be run in parallel is a data dependency problem, one that compilers have been solving for years.
If they solved it years ago, then why do we get such MASSIVE ILP boosts from bigger instruction windows? Why is 2-3 instructions of throughput the maximum efficiency we can get from in-order systems?