| Looking at it on Godbolt, it doesn't really leverage SSE on -O3, either. You can get a reasonable grasp of whether it's using SSE effectively or not just by looking at the instruction names. mulss: multiplication of a single single-precision floating point value. mulsd: multiplication of a single double-precision floating point value. mulps: multiplication of a packed group of single-precision floating point values. mulpd: multiplication of a packed group of double-precision floating point values. If you're mostly seeing -ps suffixes only on moves and shuffles, you're looking at code that is not being vectorized. (And, actually, if you're seeing a lot of shuffles, that's also a good sign its not well-vectorized.) Incidentally, if you're seeing unexpected -sd suffixes, those are often due to unintended conversions between float and double. They can have a noticeable effect on performance, especially if you end up calling the double versions of math functions (as they often use iterative algorithms that need more iterations to achieve double-precision). I'm linking GCC output, because it's simpler to follow, but you see more or less the same struggle with Clang. https://godbolt.org/z/XtVqsU |
The code generated by Rust from the naive solution uses ss instructions mostly whereas my two tries using `mm_dp_ps` and `mm_mul_ps` and `mm_hadd_ps` where both significantly slower even though it results in fewer instructions. I suspect that the issue is that for a single dot product the overhead of loading in and out of mm128 registers is more cost than it's worth.
Naive Rust version output
My handwritten version with `mm_mul_ps` and `mm_hadd_ps` Intuatively it feels like my version should be faster but it isn't. In this code I changed the the struct from 3 f32 components to an array with 4 f32 elements to avoid having to create the array during computation itself, the code also requires specific alignment not to segfault which I guess might also affected performance.0: https://github.com/k0nserv/rusttracer/commits/SIMD-mm256-dp-...