| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by BeeOnRope 2755 days ago

Actually unrolling is often very important. In some cases it is even more important with modern high speed out-of-order cores. For example, you might need several accumulators to handle instruction or memory latency.

For small loops, unrolling is the most important of all, since loop carried dependency chains are dense, and the loop overhead is a high fraction of the overall work.

It is easy to get a 2x speedup by unrolling a small loop, and even larger speedups are not uncommon.

So this "unrolling rarely helps" idea is just as much of a myth as "unrolling never helps". The main problem with unrolling is that the compilers usually don't do it intelligently - usually loops are unrolling if some kind of threshold is met, depending on compiler options - but this always happens in kind of a feed-forward way, rather thank a feed-back way, which would involve unrolling the loop and analyzing the benefit and costs after further optimization passes.

1 comments

dragontamer 2755 days ago

Okay, manual loop unrolling definitely helps, but the programmer MUST be aware of dependency chains and ILP. The compiler cannot make the decision, at least not without a pragma or maybe an autovectorization engine.

At least, I haven't seen todays (2018) compilers cut dependency chains on without a #pragma omp reduce, or other assistance from the programmer.

I've unrolled loops myself to good sucess. But it isn't as easy as some people think it is.

Without knowing about dependency chains or ILP, small loops are often best left in smaller compact form. You leverage the branch predictor and minimize uop cache usage. I'd argue the typical program benefits from compact loops more.

link

BeeOnRope 2755 days ago

You are right, the vast majority of loops should be left without unrolling. In fact, the majority of code should be compiled for size, not for speed. For a typical application you probably have 99% of the code not being very performance sensitive, and the remaining 1% is where the time is spent. Still, people regularly compile for speed to good effect, because it's so important to compile that 1% for speed that you are better off just compiling everything that way if you aren't able/willing to profile and figure what code falls in the 1%.

That's nothing specific to unrolling though: it applies to all optimizations which trade off size and speed.

About dependency chains and ILP: of course the compiler is in the perfect places to be aware of all of this. They have detailed machine models updated carefully as new CPUs come out (in fact, some of the earliest details about new CPU models often comes from compiler commits from insiders where hardware details are necessarily leaked).

A compiler could certainly statically analyze a loop using an approach similar to Intel IACA or OASCA [1], and then unroll the loop a few times and run the analysis again and see if it improves. So it doesn't need any kind of sophisticated analysis, just try-and-measure. Of course, compilers don't actually work like this. One of reasons it is not so easy is that optimizations is done in layers, against a machine independent IR, and you might not be able to carefully evaluate the impact of unrolling until some later time, possibly as late the machine-dependent instruction emission. This issue is pervasive across many compiler optimizations, and leads to many cases where a compiler generates bad code where a human wouldn't.

---

[1] https://github.com/RRZE-HPC/OSACA

link

dragontamer 2754 days ago

I had to give it some thought.

Integer-based code probably can (theoretically) be optimized by a good enough compiler I haven't thought of all overflow / underflow conditions, but most "normal" code, consisting of adds, subtracts, and AND/OR/XORs, probably can be broken up as needed by the compiler.

But floating-point code MUST execute in precisely the order that the programmer specifies. In general, (A+B)+C does NOT equal A + (B+C) for floats.

For integer-code, divison is definitely order-specific, while multiplication might be order-specific in overflow conditions (I haven't 100% thought this through yet...). But in any case, the compiler can probably split up operations to take advantage of ILP as long as the results output the same bits.

Which isn't true for some operations (ie: floating point add)

--------

In any case, the "#pragma omp reduce" gives the compiler assurance from the programmer that the ordering does not matter. I think that's why something like that could benefit from parallelism a bit better.

link

pertymcpert 2755 days ago

Just run benchmarks with -fno-unroll-loops and see what damage it does to performance.

link