| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lovasoa 2471 days ago

It looks like what the author was looking for is [1]

    f64::mul_add(self, a: f64, b: f64) -> f64

Adding it to the code indeed allows the LLVM to generate the "vfma" instruction. But it didn't significantly improve performance, on my machine at least.

    $ ./iterators 1000
    Normalized Average time = 0.0000000011943495282455513
    sumb=89259.51980374461

    $ ./mul_add 1000
    Normalized Average time = 0.0000000011861410852805122
    sumb=89259.52037960211

Maybe the performance gap is not due to what the author thought...

[1] https://doc.rust-lang.org/std/primitive.f64.html#method.mul_...

1 comments

gpderetta 2471 days ago

Hum, did the program get vectorized?

link

lovasoa 2471 days ago

As I said, the compiler did generate FMA instructions. These are SIMD instructions, so yes, the program was vectorized.

link

tom_mellior 2471 days ago

> yes, the program was vectorized

"The program" contains two hot loops. Judging from the assembly code you linked in a sibling comment, only the second of these loops was vectorized, the first one wasn't. This slower non-vectorized loop will still dominate execution time.

And for whatever it's worth, dropping the original article's

            for i in 0..n{
                b[i]=b[i]+(r/beta)*(c[i]-a[i]*b[i])
            }

in place of your loop using iterators and FMA, you still get nicely vectorized code (though without FMA) for this loop:

    .LBB6_120:
        vmovupd zmm2, zmmword ptr [r12 + 8*rcx]
        vmovupd zmm3, zmmword ptr [r12 + 8*rcx + 64]
        vmovupd zmm4, zmmword ptr [r12 + 8*rcx + 128]
        vmovupd zmm5, zmmword ptr [r12 + 8*rcx + 192]
        vmovupd zmm6, zmmword ptr [r14 + 8*rcx]
        vmovupd zmm7, zmmword ptr [r14 + 8*rcx + 64]
        vmovupd zmm8, zmmword ptr [r14 + 8*rcx + 128]
        vmovupd zmm9, zmmword ptr [r14 + 8*rcx + 192]
        vmulpd  zmm10, zmm2, zmmword ptr [rbx + 8*rcx]
        vsubpd  zmm6, zmm6, zmm10
        vmulpd  zmm10, zmm3, zmmword ptr [rbx + 8*rcx + 64]
        vsubpd  zmm7, zmm7, zmm10
        ...

Neither FMA nor iterators make a difference for whether this loop is vectorized, so any speedups are necessarily limited.

link

silentvoice 2471 days ago

Hi I wrote the blog post linked - and I feel a little silly that I didn't check that _both_ loops vectorized. So I fixed the Rust implementation to keep a running vector of partial sums which I finish up at the end - this one did vectorize. The result was a 2X performance bump, which I'm about to include in the blog post as an update.

If it's OK I'll link to this comment as the inspiration.

On the iterators versus loop: for some reason when I use the raw loop _nothing_ vectorizes, not even the obvious loop. What I read online was that bounds checking happens inside the loop body because Rust doesn't know where those indices are coming from. Using iterators instead is supposed to fix this, and it did seem to in my experiments.

link

lovasoa 2470 days ago

I liked your trick to iterate on chunks to force the compiler to vectorize the code ! Now that the code is properly vectorized, you can add the `mul_add` function, and this time you'll see a significant speedup. I tried it on my machine and it made the code 20% faster.

See the generated assembler here: https://rust.godbolt.org/z/G5A2u0

link

silentvoice 2470 days ago

Thanks! The chunks trick was a fairly straightforward translation of what I would do in C++ if the compiler wouldn't vectorize the reduction for some reason. These days most compilers will do it if you pass enough flags, a fact I really took for granted when doing this because Rust is more conservative.

I've tried using mul_add, but at the moment performance isn't much better. But I also noticed someone else on my machine running a big parallel build, so I'll wait a little later and run the full sweep over the problem sizes with mul_add.

So really the existence of FMA didn't have a performance implication it seems except to confirm that Rust wasn't passing "fast math" to LLVM where Clang was. It just so happens that "fast math" will also allow vectorization of reductions.

link

tom_mellior 2470 days ago

Great to hear that you managed another 2x speedup! Sure, feel free to link my comment if you like.

link

gameswithgo 2471 days ago

it isn't always that simple. FMA instructions are tricky to use in a way that actually improves performance, llvm may be doing it right while doing it manually that way may not.

also, sometimes a SIMD instruction is used but only on 1 lane at a time. this is actually common with floating point code.

link

jcl 2470 days ago

Something I found surprising: Some AVX2 and AVX-512 instructions consume so much power that Intel chose to have their chips dynamically slow their clock frequency when the instructions are executed. So naively switching to SIMD instructions can not only fail to improve performance, but it can also hurt the performance of unaltered code executed after it -- even unrelated code running on other cores.

https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...

link

lovasoa 2471 days ago

What do you mean "manually" ? `mul_add` is a rust function that operates on a single f64, it's still up to LLVM to choose which instructions to use and to do the vectorization.

link

gpderetta 2471 days ago

well technically FMA doesn't necessarily imply vectorization; it depends on whether the P{S,D} vs the S{S,D} suffixed instructions were being used, but if you saw the (P)arallel variants, then yes, it was vectorized.

link

lovasoa 2471 days ago

You can see the full compiler output here:

https://rust.godbolt.org/z/FbDqye

link

gpderetta 2471 days ago

Thanks. It seems that at least parts of the inner loop have been vectorized. Edit: If Im reading the asm correctly, the second zip has been vectorized, but the first fold was not.

link