Hacker News new | ask | show | jobs
by ethelward 2168 days ago
Although AVX instructions, those are scalar. The SS pattern at the end is for Scalar Single-precision, so only one lane of the whole SIMD register is used.
1 comments

You're absolutely right, I wasn't reading. The minimal flags that seem to get GCC 10.1 to vectorize are -O3 with optionally -mavx to go wider. Clang doesn't want to vectorize until you give -ffast-math
This can't really be vectorized without -ffast-math regardless. Notice that even with gcc, it's only vectorizing the multiply, and the adds are still scalar. This probably isn't that much of an improvement over the scalar code.

-ffast-math allows the multiply-adds to be reassociated, enabling much better approaches. Clang with -O2 -ffast-math produces good code (vfmadd132ps with 4 independent accumulators), I can't get GCC to produce good code with any flags.

For gcc you need `-funroll-loops` to unroll and `-fvariable-expansion-in-unroller` to get multiple accumulation vectors. By default, it'll only use 2 accumulation vectors. You can set it to 4 (for example) with `--param max-variable-expansions-in-unroller=4`.

`-funroll-loops` by default unrolls 8 times. Unrolling beyond the number of accumulators is wasteful for simple operations like dot procuts or summations, so you may want to control that with `--param -max-unroll-times=4`.

Thus, the following works and will produce generally faster code than LLVM (because LLVM doesn't vectorize the remainder, giving you potentially large numbers of scalar operations):

`-Ofast -funroll-loops --param max-unroll-times=4 -fvariable-expansion-in-unroller --param max-variable-expansions-in-unroller=4`

Godbolt: https://godbolt.org/z/4PXSqs

That's interesting, and I should have a closer look later, being long out of following GCC optimizations; I'm not good with assembler and never remember how the dumps work. The interaction between unroll-and-jam and unroll is confusing, in particular. (dot here is the unit-step loop, gfortran-10 gives a bit more info than -8, and more realistic avx targets are different, of course.)

  $ gfortran-10  -c dot.f90 -Ofast -fopt-info  
  dot.f90:3:7: optimized: loop vectorized using 16 byte vectors
  dot.f90:3:7: optimized: loop with 2 iterations completely unrolled (header execution count 64530389)

  $ gfortran-10  -c dot.f90 -Ofast -fopt-info  -funroll-loops --param max-unroll-times=4 -fvariable-expansion-in-unroller --param max-variable-expansions-in-unroller=4
  dot.f90:3:7: optimized: loop vectorized using 16 byte vectors
  dot.f90:3:7: optimized: loop with 2 iterations completely unrolled (header execution count 64530389)
  dot.f90:4:0: optimized: loop unrolled 3 times