Hacker News new | ask | show | jobs
by Remnant44 712 days ago
It's rare to need to work at this level of optimization, but this is a really neat trick!

Modern cores are quite wide - capable of running 6-8 instructions at once, as long as there are no dependencies. Something as simple and common as a summation loop can often be sped up 2-4x by simply having multiple accumulators that you then combine after the loop body; this lets the processor "run ahead" without loop carried dependencies and execute multiple accumulations each cycle.

This technique is similar in concept, but even more general. Put the "guess" in the registers and run with it, relying on a second instruction within a branch to correct the guess if its wrong. Assuming your guess is overwhelmingly accurate... this lets you unlock the width of modern cores in code that otherwise wouldn't present a lot of ILP!

Clever.

2 comments

>Something as simple and common as a summation loop can often be sped up 2-4x by simply having multiple accumulators that you then combine after the loop body; this lets the processor "run ahead" without loop carried dependencies and execute multiple accumulations each cycle.

Shouldn't the compiler be smart enough to figure that out these days (at least if it truly is a straightforward accumulation loop)?

Such optimizations may be forbidden by the fact that floating point addition is not associative unless you tell the compiler not to worry about that (I believe)
> floating point addition is not associative unless you tell the compiler not to worry about that

Ah yes, fun and safe math optimizations.

Ah yes, I always forget about the rules with floats. My line of work only ever uses ints.
If this would be possible for an application, does it not make more sense to use SIMD instructions at that point?
You want to use SIMD and multiple accumulators. In fact not only you want to use as many accumulators as the number of SIMD ALUs, as SIMD operations are usually longer latency you usually unroll SIMD loops for software pipelining, using more accumulators to break loop carried dependencies.