|
|
|
|
|
by dragontamer
1151 days ago
|
|
Modern CPU cores can perform a multiplication and addition every clock tick. Heck, I'd expect a modern Zen4 core to be able to do like 4 parallel 64-bit multiplications per clock tick on it's integer pipelines, and maybe 32x parallel 32-bit multiplications per clock tick on it's vector pipelines. Multiplications we're bad 40 years ago, but the year 2020 called and FMAC is incredibly optimized today. You should still avoid integer division (floating point division is commonly optimized as reciprocal and then multiply). But multiplications are really really fast at least as far back as 2008 or so. ------- I'm pretty sure multiplication's latency is only 5 clocks, but with all the out of order processing that occurs on modern cores, latency of just 5 ticks is rarely is the bottleneck. (A DDR4 memory load is like 200+ cycles of latency. You shouldn't even worry about 5 cycles like multiplication, especially because those out of order cores will find some work to parallelize in that time). ----- > you lookup the value in array[i][k] You know a L1 cache lookup these days is like 4 cycles of latency right? And I'm pretty sure you have fewer load/store units than multiplication units. So a load/store, even to L1 cache, might use more resources than the multiply. Might, I'd have to benchmark to be sure. |
|