|
|
|
|
|
by BeeOnRope
2758 days ago
|
|
You are right, the vast majority of loops should be left without unrolling. In fact, the majority of code should be compiled for size, not for speed. For a typical application you probably have 99% of the code not being very performance sensitive, and the remaining 1% is where the time is spent. Still, people regularly compile for speed to good effect, because it's so important to compile that 1% for speed that you are better off just compiling everything that way if you aren't able/willing to profile and figure what code falls in the 1%. That's nothing specific to unrolling though: it applies to all optimizations which trade off size and speed. About dependency chains and ILP: of course the compiler is in the perfect places to be aware of all of this. They have detailed machine models updated carefully as new CPUs come out (in fact, some of the earliest details about new CPU models often comes from compiler commits from insiders where hardware details are necessarily leaked). A compiler could certainly statically analyze a loop using an approach similar to Intel IACA or OASCA [1], and then unroll the loop a few times and run the analysis again and see if it improves. So it doesn't need any kind of sophisticated analysis, just try-and-measure. Of course, compilers don't actually work like this. One of reasons it is not so easy is that optimizations is done in layers, against a machine independent IR, and you might not be able to carefully evaluate the impact of unrolling until some later time, possibly as late the machine-dependent instruction emission. This issue is pervasive across many compiler optimizations, and leads to many cases where a compiler generates bad code where a human wouldn't. --- [1] https://github.com/RRZE-HPC/OSACA |
|
Integer-based code probably can (theoretically) be optimized by a good enough compiler I haven't thought of all overflow / underflow conditions, but most "normal" code, consisting of adds, subtracts, and AND/OR/XORs, probably can be broken up as needed by the compiler.
But floating-point code MUST execute in precisely the order that the programmer specifies. In general, (A+B)+C does NOT equal A + (B+C) for floats.
For integer-code, divison is definitely order-specific, while multiplication might be order-specific in overflow conditions (I haven't 100% thought this through yet...). But in any case, the compiler can probably split up operations to take advantage of ILP as long as the results output the same bits.
Which isn't true for some operations (ie: floating point add)
--------
In any case, the "#pragma omp reduce" gives the compiler assurance from the programmer that the ordering does not matter. I think that's why something like that could benefit from parallelism a bit better.