| Indeed. My personal thoughts is that we should design a CPU where these kinds of pipelines / executions are more explicit, and then write magic compilers that can pull parallelism out of our programs to be in the more explicit parallelism form that this new CPU would prefer. You'd still be tied to an architecture, but moving to a new architecture (ie: 2x SIMD pipelines in the future) would be as easy as recompiling, in theory. Then I realized that I've reinvented VLIW / Intel Itanium. And that's a silly, silly place and we probably shouldn't go there again :-p -------- The MIMD (multiple-instruction multiple data) abilities of modern CPUs are quite amazing in any case, and its always fun to take advantage of it. Even with a singular instruction stream like in this example, it is obvious that modern CPUs have gross parallelism at the instruction level. Its a bit of a shame that these high-performance toys we write are kind of unsustainable... requiring in depth assembly knowledge and microarchitecture-specific concepts to optimize (that often become obsolete as these designs inevitably change every 5 years or so). Then again, its probably a good idea to practice writing code at this level to remind us that the modern CPU is in fact a machine with defined performance characteristics that we can take advantage of... |
In principle you could probably use some kind of symbolic math solving library to try to detect optimization opportunities like this in general code. In practice it just wouldn't be worth it, even at -O3, because it would add a ton of CPU and memory overhead to compilation and the optimization would be very rarely applicable in the first place.