Hacker News new | ask | show | jobs
by MauranKilom 2234 days ago
Without looking deeper into it, I assume that you are falling prey to dependency chains here. All your vectorized code is one big dependency chain (every instruction uses the result of the previous one in xmm0). The scalar code has more instructions but they form multiple, somewhat separate, dependency chains.

On microcode level, you can generally have multiple of the same kind of instruction running "in parallel" if they are independent.

For example, look at 256 bit vmulps here: https://software.intel.com/sites/landingpage/IntrinsicsGuide...

On Ivy Bridge, you can start one vmulps per cycle, but it takes 5 cycles before you get a result. If you do several vmulps (and similar) in one long dependency chain, you will only progress by one instruction every 5 cycles!

Another point to consider is that multithreading in a hyperthreading environment could change these result: If I'm not mistaken, the two hyperthreads sharing a core compete for the same execution ports but have separate instruction scheduling. What this means is that, in the above scenario, you could theoretically have two hyperthreads each executing one vmulps every five cycles on the same core, so that you actually get double the speed from two threads over one. However, less dependency-laden code (the scalar version?) could fully saturate the floating point related execution ports with just one thread, in which case you might not see any speed benefit at all from a second thread.

This of course strongly depends on the hardware and how how the code is. I'm also not confident that either of these effects are necessarily at play or the prime influence here. But if you are interested in writing well-performing code on this level, these are topics you should look into!