|
|
|
|
|
by dragontamer
2755 days ago
|
|
Okay, manual loop unrolling definitely helps, but the programmer MUST be aware of dependency chains and ILP. The compiler cannot make the decision, at least not without a pragma or maybe an autovectorization engine. At least, I haven't seen todays (2018) compilers cut dependency chains on without a #pragma omp reduce, or other assistance from the programmer. I've unrolled loops myself to good sucess. But it isn't as easy as some people think it is. Without knowing about dependency chains or ILP, small loops are often best left in smaller compact form. You leverage the branch predictor and minimize uop cache usage. I'd argue the typical program benefits from compact loops more. |
|
That's nothing specific to unrolling though: it applies to all optimizations which trade off size and speed.
About dependency chains and ILP: of course the compiler is in the perfect places to be aware of all of this. They have detailed machine models updated carefully as new CPUs come out (in fact, some of the earliest details about new CPU models often comes from compiler commits from insiders where hardware details are necessarily leaked).
A compiler could certainly statically analyze a loop using an approach similar to Intel IACA or OASCA [1], and then unroll the loop a few times and run the analysis again and see if it improves. So it doesn't need any kind of sophisticated analysis, just try-and-measure. Of course, compilers don't actually work like this. One of reasons it is not so easy is that optimizations is done in layers, against a machine independent IR, and you might not be able to carefully evaluate the impact of unrolling until some later time, possibly as late the machine-dependent instruction emission. This issue is pervasive across many compiler optimizations, and leads to many cases where a compiler generates bad code where a human wouldn't.
---
[1] https://github.com/RRZE-HPC/OSACA