Hacker News new | ask | show | jobs
by savant_penguin 1478 days ago
What would be the difference in power consumption from each method? (Would it be always better to multiply? If so why not multiply by one?)
4 comments

The general rule to follow in power consumption on CPUs is to do your work quickly and then get to sleep. Propagating clock is going to eat the bulk of your power. The mild difference between multiply and add in actual usage is inside the noise (orders of magnitude smaller). The bigger penalty in this case is the inter-iteration dependency, which, vectorized or not, runs the risk of holding up the whole show due to pipelining.

As a performance rule on modern processors: avoid using the result of a calculation as long as you reasonably can (in tight loops... You don't want to be out of cache.).

Have fun threading the needle!

Usually faster version always consumes less power, as this allows the core more time in sleep. This is known as the race-to-sleep or race-to-idle paradox.
The problem here is that the additions depends on values computed in the previous iteration of the loop. The version with multiplication is faster because there is no dependencies with the previous iteration so the CPU has more freedom scheduling the operations.

The power consumption is a good question.

Scheduling plays a part, but it is definitely more about vectorization.
It's almost certainly more about scheduling than vectorization. The data dependencies is going to constantly stall the CPU pipeline, so it's just not able to retire instructions very quickly. The SIMD part is almost certainly a red herring. It's helping, but it's far from why it's so much faster. Tiger Lake can retire 4 plain ol' ADD operations per clock[1] - you don't need SIMD / vectorization to get instruction level parallelism. But you do need to ensure there's no data dependencies. The data dependency here is the 90% cost. The SIMD is just the cherry on top.

1: https://www.agner.org/optimize/instruction_tables.pdf

In this case since the slower one is spending most of its time stalled out, the faster one will probably be more power efficient. This isn't an AVX512 power-gobbling monstrosity issue.