Hacker News new | ask | show | jobs
by userbinator 2758 days ago
So it's an issue of the optimizer; as is often the case, it unrolls too aggressively and shoots itself in the foot, all the while missing out on various other opportunities.

In my experience, loop unrolling should basically never be done except in extremely degenerate cases; I remember not long ago someone I know who also optimises Asm remarking "it should've died along with the RISC fad". The original goal was to reduce per-iteration overhead associated with checking for end-of-loop, but any superscalar/OoO/speculative processor can "execute past" those instructions anyway; all that unrolling will do is bloat the code and work against caching. Memory bandwidth is often the bottleneck, not the core.

6 comments

> In my experience, loop unrolling should basically never be done except in extremely degenerate cases

Not true. Like many such optimizations, loop unrolling can be useful because it makes downstream loads constant.

For example:

    float identity[4][4];
    for (unsigned y = 0; y < 4; y++)
        for (unsigned x = 0; x < 4; x++)
            identity[y][x] = y == x ? 1 : 0;
    ... do some matrix math ...
In this case, the compiler probably wants to unroll the loops so that it can straightforwardly forward the constant matrix entries directly to the matrix arithmetic. It'll likely be able to eliminate lots of operations that way.

(You might ask "who would write this code?" As Schemers say: "macros do.")

See LLVM's heuristics: http://llvm.org/doxygen/LoopUnrollPass_8cpp.html#ad7c38776d7...

> (You might ask "who would write this code?" As Schemers say: "macros do.")

To expand on this point - in the more prosaic world of C++ - this sort of code comes about all the time in templated code. For example, the above loop you posted might have been found in something like:

``` template <unsigned N, unsigned M> class Matrix { static Matrix Identity() { ... } }

    auto m = Matrix<4, 4>::Identity();
```

The other major source of these sorts of constants leading to DCE oppportunities is inlining. Consider a more classical, matrix implementation that is not templated and doesn't lift its dimensions into the type:

``` class Matrix { unsigned n; unsigned m; static Matrix Identity(unsigned n, unsigned m) { ... } }

    // Somewhere else
    Matrix m = Matrix::Identity(4, 4);
```

Here, the inlining of the call to `Identity` at the call-site will turn the `n` and `m` in the body of `Identity` into the constant 4.

If I had to make an educated guess - inlining typically generates these (i.e. partial evaluation, constant folding, an DCE) situations most often in compilers. An incredible amount of information can flow from caller to callee when you specialize the callee for that call-site.

I didn't understand dead elimination until I wrote enough macros. It is a lot easier to generate code and have the optimizer fix it than to make sure to always generate efficient code.

This is also how compilers do things, but it is only that we schemers can see the intermediate result much easier using simple source->source transformations.

As an example: I wrote a clone of racket's for loops. They use #:when and #:break clauses. Instead of generating them when they were present the break clauses just defaulted to #f and the when clauses to #t, meaning that the break clause of the generated code was just optimized away if the user didn't have any break clauses and the test for the when clauses was optimized to a regular (begin ...).

It simplified the code a lot and the optimizer was a lot faster than having to do it all myself at expansion time. I lazily just generate about 30 lines of code for a simple loop that in the end sometimes even is unrolled to the final reault due to guiles optimizer and partial evaluation.

I had a case where I was counting occurrences of 8-bit integer values. Manual loop unrolling provided a 33% speed up.
In addition to the optimisations already mentioned, loop unrolling also typically enables vectorisation in compilers. You might argue that for vectorisation it is not exactly necessary to have the relevant oerations next to each other in a continuous instruction stream, but it makes the vectorisation pass a lot nicer and simpler (if it can be called that to begin with).
Assuming you unroll just enough to fill a SIMD lane. As mentioned, in this case aggressive (16-fold) unfolding actually appears to have prevented vectorization. (A smart enough vectorizer could of course handle this but unrolling just to ”re-roll” in a later pass doesn’t sound very smart.)
Loop unrolling is useful only when it enables other optimisations. The most common is constant folding.
Fully agree.

In most cases on modern systems, small loops should remain compact as possible, to stay in the uop cache. The "for" loop overhead (the inc, cmp, and jmp instructions) effectively execute in parallel. Modern systems are highly out-of-order and the for-loop overhead is virtually nil.

Actually unrolling is often very important. In some cases it is even more important with modern high speed out-of-order cores. For example, you might need several accumulators to handle instruction or memory latency.

For small loops, unrolling is the most important of all, since loop carried dependency chains are dense, and the loop overhead is a high fraction of the overall work.

It is easy to get a 2x speedup by unrolling a small loop, and even larger speedups are not uncommon.

So this "unrolling rarely helps" idea is just as much of a myth as "unrolling never helps". The main problem with unrolling is that the compilers usually don't do it intelligently - usually loops are unrolling if some kind of threshold is met, depending on compiler options - but this always happens in kind of a feed-forward way, rather thank a feed-back way, which would involve unrolling the loop and analyzing the benefit and costs after further optimization passes.

Okay, manual loop unrolling definitely helps, but the programmer MUST be aware of dependency chains and ILP. The compiler cannot make the decision, at least not without a pragma or maybe an autovectorization engine.

At least, I haven't seen todays (2018) compilers cut dependency chains on without a #pragma omp reduce, or other assistance from the programmer.

I've unrolled loops myself to good sucess. But it isn't as easy as some people think it is.

Without knowing about dependency chains or ILP, small loops are often best left in smaller compact form. You leverage the branch predictor and minimize uop cache usage. I'd argue the typical program benefits from compact loops more.

You are right, the vast majority of loops should be left without unrolling. In fact, the majority of code should be compiled for size, not for speed. For a typical application you probably have 99% of the code not being very performance sensitive, and the remaining 1% is where the time is spent. Still, people regularly compile for speed to good effect, because it's so important to compile that 1% for speed that you are better off just compiling everything that way if you aren't able/willing to profile and figure what code falls in the 1%.

That's nothing specific to unrolling though: it applies to all optimizations which trade off size and speed.

About dependency chains and ILP: of course the compiler is in the perfect places to be aware of all of this. They have detailed machine models updated carefully as new CPUs come out (in fact, some of the earliest details about new CPU models often comes from compiler commits from insiders where hardware details are necessarily leaked).

A compiler could certainly statically analyze a loop using an approach similar to Intel IACA or OASCA [1], and then unroll the loop a few times and run the analysis again and see if it improves. So it doesn't need any kind of sophisticated analysis, just try-and-measure. Of course, compilers don't actually work like this. One of reasons it is not so easy is that optimizations is done in layers, against a machine independent IR, and you might not be able to carefully evaluate the impact of unrolling until some later time, possibly as late the machine-dependent instruction emission. This issue is pervasive across many compiler optimizations, and leads to many cases where a compiler generates bad code where a human wouldn't.

---

[1] https://github.com/RRZE-HPC/OSACA

I had to give it some thought.

Integer-based code probably can (theoretically) be optimized by a good enough compiler I haven't thought of all overflow / underflow conditions, but most "normal" code, consisting of adds, subtracts, and AND/OR/XORs, probably can be broken up as needed by the compiler.

But floating-point code MUST execute in precisely the order that the programmer specifies. In general, (A+B)+C does NOT equal A + (B+C) for floats.

For integer-code, divison is definitely order-specific, while multiplication might be order-specific in overflow conditions (I haven't 100% thought this through yet...). But in any case, the compiler can probably split up operations to take advantage of ILP as long as the results output the same bits.

Which isn't true for some operations (ie: floating point add)

--------

In any case, the "#pragma omp reduce" gives the compiler assurance from the programmer that the ordering does not matter. I think that's why something like that could benefit from parallelism a bit better.

Just run benchmarks with -fno-unroll-loops and see what damage it does to performance.
Well I've seen the opposite, try a naive string function of some sort (strlen, etc) now manually unroll.
Yeah, sometimes the compiler unrolls too much and innocent looking one-liner can be compiled into a monstrosity like this:

https://godbolt.org/z/aKtko5

Are there no compilers which attempt to look at that code, decide 'that looks like 1<<(num-2) when n>=2', and replace the code entirely?

There must be so many examples of bubble sort where quicksort would be better, and other code patterns which can be identified and replaced with something orders of magnitude faster.

Is that more performant?