Hacker News new | ask | show | jobs
by dragontamer 2755 days ago
Okay, manual loop unrolling definitely helps, but the programmer MUST be aware of dependency chains and ILP. The compiler cannot make the decision, at least not without a pragma or maybe an autovectorization engine.

At least, I haven't seen todays (2018) compilers cut dependency chains on without a #pragma omp reduce, or other assistance from the programmer.

I've unrolled loops myself to good sucess. But it isn't as easy as some people think it is.

Without knowing about dependency chains or ILP, small loops are often best left in smaller compact form. You leverage the branch predictor and minimize uop cache usage. I'd argue the typical program benefits from compact loops more.

2 comments

You are right, the vast majority of loops should be left without unrolling. In fact, the majority of code should be compiled for size, not for speed. For a typical application you probably have 99% of the code not being very performance sensitive, and the remaining 1% is where the time is spent. Still, people regularly compile for speed to good effect, because it's so important to compile that 1% for speed that you are better off just compiling everything that way if you aren't able/willing to profile and figure what code falls in the 1%.

That's nothing specific to unrolling though: it applies to all optimizations which trade off size and speed.

About dependency chains and ILP: of course the compiler is in the perfect places to be aware of all of this. They have detailed machine models updated carefully as new CPUs come out (in fact, some of the earliest details about new CPU models often comes from compiler commits from insiders where hardware details are necessarily leaked).

A compiler could certainly statically analyze a loop using an approach similar to Intel IACA or OASCA [1], and then unroll the loop a few times and run the analysis again and see if it improves. So it doesn't need any kind of sophisticated analysis, just try-and-measure. Of course, compilers don't actually work like this. One of reasons it is not so easy is that optimizations is done in layers, against a machine independent IR, and you might not be able to carefully evaluate the impact of unrolling until some later time, possibly as late the machine-dependent instruction emission. This issue is pervasive across many compiler optimizations, and leads to many cases where a compiler generates bad code where a human wouldn't.

---

[1] https://github.com/RRZE-HPC/OSACA

I had to give it some thought.

Integer-based code probably can (theoretically) be optimized by a good enough compiler I haven't thought of all overflow / underflow conditions, but most "normal" code, consisting of adds, subtracts, and AND/OR/XORs, probably can be broken up as needed by the compiler.

But floating-point code MUST execute in precisely the order that the programmer specifies. In general, (A+B)+C does NOT equal A + (B+C) for floats.

For integer-code, divison is definitely order-specific, while multiplication might be order-specific in overflow conditions (I haven't 100% thought this through yet...). But in any case, the compiler can probably split up operations to take advantage of ILP as long as the results output the same bits.

Which isn't true for some operations (ie: floating point add)

--------

In any case, the "#pragma omp reduce" gives the compiler assurance from the programmer that the ordering does not matter. I think that's why something like that could benefit from parallelism a bit better.

Just run benchmarks with -fno-unroll-loops and see what damage it does to performance.