|
|
|
|
|
by dragontamer
1486 days ago
|
|
> Do you mean like this? I get this to about as fast as the first "unoptimized" version in the SO post, but not faster. Yeah, something like that. I haven't double-checked your math, but the idea is what I was going for. I'm always "surprised" by the fact that CPUs care more about bandwidth rather than latency these days. A lot of CPUs (Intel, AMD, ARM, etc. etc.) support 1x or even 2x SIMD-multiplications per clock tick, even though they take 5 clock ticks to execute. I guess the original "simple" code may have had a multiply in there, but that's not a big deal these days (throughput wise), even though its a big-deal latency wise. So getting rid of those multiplies and cutting down the latency (ie: using only add statements) barely helps at all, maybe with no measurable difference. One of these days, I'll actually remember that fact, lol. |
|