Hacker News new | ask | show | jobs
by ascar 1478 days ago
> Tldr: autovectorizer transforms the first code into SIMD instructions, but the second into 64 bit instructions.

That's not the tldr. It's actually more wrong than right.

The actual TLDR is: loop dependencies prohibit the compiler _and_ your CPU from parallelizing. The SIMD part is the compiler, the more important part here is the CPU though.

Long explanation:

Let's start with the compiler. Yes, the first version can be vectorized and you can see that in the included screenshots of the assembly output. The use of addpd [0] (add 2 pairs of 64bit doubles in simultaneously) instead of addsd (add 1 pair of 64bit doubles). Both operations have identical throughput/latency on any architecture not too ancient (e.g. check information of the corresponding intrinsic here [3]. Unfortunately Agner instruction_tables doesn't include the ADDSD for most architectures.) So while you can crunch 2 operations using SIMD at the same time, you are still doing 3 multiplications and 2 additions instead of 2 additions. The multiplications and addition have the same throughput at least on Skylake. So we can use simple math and 5/2 is still more than 2!

Now to the CPU part. The loop dependency prohibits the compiler to properly utilize instruction pipelining [4] and now differences in throughput vs latency come into play. In a pipelined scenario the CPU can work on multiple steps of the instruction cycle in parallel (from fetching the instruction, then decoding it, to executing the operation and eventually writing the result back to the register) and thus achieve higher throughput. However, if your next instruction depends on the instruction before the CPU has to finish all the steps of a pipeline from instruction start to finish before the next instruction can start. This is the latency of an instruction. For example mulsd/mulpd have 8 times higher latency than throughput on Intel Skylake.

So while SIMD comes in at a factor of 2, the loop dependency stops the CPU from realizing a potential factor of 8.

Peter Cordes also correctly points this out in his comments and answer to the question on stackoverflow.

[1] https://www.felixcloutier.com/x86/addpd

[2] https://www.felixcloutier.com/x86/addsd

[3] https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

[4] https://en.wikipedia.org/wiki/Instruction_pipelining

1 comments

In short, the CPU can execute up to 6 or 8 instructions at once, but only if there are no dependencies between instructions. In the second version every operation depends on the previous result.