|
|
|
|
|
by lovasoa
2427 days ago
|
|
I liked your trick to iterate on chunks to force the compiler to vectorize the code !
Now that the code is properly vectorized, you can add the `mul_add` function, and this time you'll see a significant speedup. I tried it on my machine and it made the code 20% faster. See the generated assembler here: https://rust.godbolt.org/z/G5A2u0 |
|
I've tried using mul_add, but at the moment performance isn't much better. But I also noticed someone else on my machine running a big parallel build, so I'll wait a little later and run the full sweep over the problem sizes with mul_add.
So really the existence of FMA didn't have a performance implication it seems except to confirm that Rust wasn't passing "fast math" to LLVM where Clang was. It just so happens that "fast math" will also allow vectorization of reductions.