Hacker News new | ask | show | jobs
by pcwalton 2755 days ago
> In my experience, loop unrolling should basically never be done except in extremely degenerate cases

Not true. Like many such optimizations, loop unrolling can be useful because it makes downstream loads constant.

For example:

    float identity[4][4];
    for (unsigned y = 0; y < 4; y++)
        for (unsigned x = 0; x < 4; x++)
            identity[y][x] = y == x ? 1 : 0;
    ... do some matrix math ...
In this case, the compiler probably wants to unroll the loops so that it can straightforwardly forward the constant matrix entries directly to the matrix arithmetic. It'll likely be able to eliminate lots of operations that way.

(You might ask "who would write this code?" As Schemers say: "macros do.")

See LLVM's heuristics: http://llvm.org/doxygen/LoopUnrollPass_8cpp.html#ad7c38776d7...

3 comments

> (You might ask "who would write this code?" As Schemers say: "macros do.")

To expand on this point - in the more prosaic world of C++ - this sort of code comes about all the time in templated code. For example, the above loop you posted might have been found in something like:

``` template <unsigned N, unsigned M> class Matrix { static Matrix Identity() { ... } }

    auto m = Matrix<4, 4>::Identity();
```

The other major source of these sorts of constants leading to DCE oppportunities is inlining. Consider a more classical, matrix implementation that is not templated and doesn't lift its dimensions into the type:

``` class Matrix { unsigned n; unsigned m; static Matrix Identity(unsigned n, unsigned m) { ... } }

    // Somewhere else
    Matrix m = Matrix::Identity(4, 4);
```

Here, the inlining of the call to `Identity` at the call-site will turn the `n` and `m` in the body of `Identity` into the constant 4.

If I had to make an educated guess - inlining typically generates these (i.e. partial evaluation, constant folding, an DCE) situations most often in compilers. An incredible amount of information can flow from caller to callee when you specialize the callee for that call-site.

I didn't understand dead elimination until I wrote enough macros. It is a lot easier to generate code and have the optimizer fix it than to make sure to always generate efficient code.

This is also how compilers do things, but it is only that we schemers can see the intermediate result much easier using simple source->source transformations.

As an example: I wrote a clone of racket's for loops. They use #:when and #:break clauses. Instead of generating them when they were present the break clauses just defaulted to #f and the when clauses to #t, meaning that the break clause of the generated code was just optimized away if the user didn't have any break clauses and the test for the when clauses was optimized to a regular (begin ...).

It simplified the code a lot and the optimizer was a lot faster than having to do it all myself at expansion time. I lazily just generate about 30 lines of code for a simple loop that in the end sometimes even is unrolled to the final reault due to guiles optimizer and partial evaluation.

I had a case where I was counting occurrences of 8-bit integer values. Manual loop unrolling provided a 33% speed up.