> Do you mean like this? I get this to about as fast as the first "unoptimized" version in the SO post, but not faster.
Yeah, something like that. I haven't double-checked your math, but the idea is what I was going for.
I'm always "surprised" by the fact that CPUs care more about bandwidth rather than latency these days. A lot of CPUs (Intel, AMD, ARM, etc. etc.) support 1x or even 2x SIMD-multiplications per clock tick, even though they take 5 clock ticks to execute.
I guess the original "simple" code may have had a multiply in there, but that's not a big deal these days (throughput wise), even though its a big-deal latency wise.
So getting rid of those multiplies and cutting down the latency (ie: using only add statements) barely helps at all, maybe with no measurable difference.
One of these days, I'll actually remember that fact, lol.
If they were integer variables, I guess the compiler would have done that, but you can't really do that with floats because i+A+A is not necessarily i+2*A. (Of course, in this particular example, the difference doesn't matter for the programmer, but the compiler doesn't know that!)
I think there's some gcc option that enables these "dangerous" optimizations. -ffast-math, or something like that?
> as long as i, i+1, i+2, i+3, ... i+7 are not dependent on each other
I really don't see how that works in improving this.
You can only calculate i+8, for calculating i+9 you depend on 8. And you can't go in strides either since i+16 depends on i+15 which you've not calculated so far unless you want to intermix the stateful and non-stateful code. I'd rather not go there.
Ex: as long as i, i+1, i+2, i+3, ... i+7 are not dependent on each other, you can vectorize to SIMD-width 8.
Or in other words: i+7 can depend on i-1 no problems.