|
|
|
|
|
by creato
2177 days ago
|
|
This can't really be vectorized without -ffast-math regardless. Notice that even with gcc, it's only vectorizing the multiply, and the adds are still scalar. This probably isn't that much of an improvement over the scalar code. -ffast-math allows the multiply-adds to be reassociated, enabling much better approaches. Clang with -O2 -ffast-math produces good code (vfmadd132ps with 4 independent accumulators), I can't get GCC to produce good code with any flags. |
|
`-funroll-loops` by default unrolls 8 times. Unrolling beyond the number of accumulators is wasteful for simple operations like dot procuts or summations, so you may want to control that with `--param -max-unroll-times=4`.
Thus, the following works and will produce generally faster code than LLVM (because LLVM doesn't vectorize the remainder, giving you potentially large numbers of scalar operations):
`-Ofast -funroll-loops --param max-unroll-times=4 -fvariable-expansion-in-unroller --param max-variable-expansions-in-unroller=4`
Godbolt: https://godbolt.org/z/4PXSqs