Although AVX instructions, those are scalar. The SS pattern at the end is for Scalar Single-precision, so only one lane of the whole SIMD register is used.
You're absolutely right, I wasn't reading. The minimal flags that seem to get GCC 10.1 to vectorize are -O3 with optionally -mavx to go wider. Clang doesn't want to vectorize until you give -ffast-math
This can't really be vectorized without -ffast-math regardless. Notice that even with gcc, it's only vectorizing the multiply, and the adds are still scalar. This probably isn't that much of an improvement over the scalar code.
-ffast-math allows the multiply-adds to be reassociated, enabling much better approaches. Clang with -O2 -ffast-math produces good code (vfmadd132ps with 4 independent accumulators), I can't get GCC to produce good code with any flags.
For gcc you need `-funroll-loops` to unroll and `-fvariable-expansion-in-unroller` to get multiple accumulation vectors. By default, it'll only use 2 accumulation vectors. You can set it to 4 (for example) with `--param max-variable-expansions-in-unroller=4`.
`-funroll-loops` by default unrolls 8 times. Unrolling beyond the number of accumulators is wasteful for simple operations like dot procuts or summations, so you may want to control that with `--param -max-unroll-times=4`.
Thus, the following works and will produce generally faster code than LLVM (because LLVM doesn't vectorize the remainder, giving you potentially large numbers of scalar operations):
That's interesting, and I should have a closer look later, being long out of following GCC optimizations; I'm not good with assembler and never remember how the dumps work. The interaction between unroll-and-jam and unroll is confusing, in particular. (dot here is the unit-step loop, gfortran-10 gives a bit more info than -8, and more realistic avx targets are different, of course.)