Hacker News new | ask | show | jobs
by dragontamer 1486 days ago
> Do you mean like this? I get this to about as fast as the first "unoptimized" version in the SO post, but not faster.

Yeah, something like that. I haven't double-checked your math, but the idea is what I was going for.

I'm always "surprised" by the fact that CPUs care more about bandwidth rather than latency these days. A lot of CPUs (Intel, AMD, ARM, etc. etc.) support 1x or even 2x SIMD-multiplications per clock tick, even though they take 5 clock ticks to execute.

I guess the original "simple" code may have had a multiply in there, but that's not a big deal these days (throughput wise), even though its a big-deal latency wise.

So getting rid of those multiplies and cutting down the latency (ie: using only add statements) barely helps at all, maybe with no measurable difference.

One of these days, I'll actually remember that fact, lol.