| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Remnant44 712 days ago

It's rare to need to work at this level of optimization, but this is a really neat trick!

Modern cores are quite wide - capable of running 6-8 instructions at once, as long as there are no dependencies. Something as simple and common as a summation loop can often be sped up 2-4x by simply having multiple accumulators that you then combine after the loop body; this lets the processor "run ahead" without loop carried dependencies and execute multiple accumulations each cycle.

This technique is similar in concept, but even more general. Put the "guess" in the registers and run with it, relying on a second instruction within a branch to correct the guess if its wrong. Assuming your guess is overwhelmingly accurate... this lets you unlock the width of modern cores in code that otherwise wouldn't present a lot of ILP!

Clever.

2 comments

xxpor 711 days ago

>Something as simple and common as a summation loop can often be sped up 2-4x by simply having multiple accumulators that you then combine after the loop body; this lets the processor "run ahead" without loop carried dependencies and execute multiple accumulations each cycle.

Shouldn't the compiler be smart enough to figure that out these days (at least if it truly is a straightforward accumulation loop)?

link

johnbcoughlin 711 days ago

Such optimizations may be forbidden by the fact that floating point addition is not associative unless you tell the compiler not to worry about that (I believe)

link

account42 710 days ago

> floating point addition is not associative unless you tell the compiler not to worry about that

Ah yes, fun and safe math optimizations.

link

xxpor 710 days ago

Ah yes, I always forget about the rules with floats. My line of work only ever uses ints.

link

spockz 711 days ago

If this would be possible for an application, does it not make more sense to use SIMD instructions at that point?

link

gpderetta 711 days ago

You want to use SIMD and multiple accumulators. In fact not only you want to use as many accumulators as the number of SIMD ALUs, as SIMD operations are usually longer latency you usually unroll SIMD loops for software pipelining, using more accumulators to break loop carried dependencies.

link