|
> yes, the program was vectorized "The program" contains two hot loops. Judging from the assembly code you linked in a sibling comment, only the second of these loops was vectorized, the first one wasn't. This slower non-vectorized loop will still dominate execution time. And for whatever it's worth, dropping the original article's for i in 0..n{
b[i]=b[i]+(r/beta)*(c[i]-a[i]*b[i])
}
in place of your loop using iterators and FMA, you still get nicely vectorized code (though without FMA) for this loop: .LBB6_120:
vmovupd zmm2, zmmword ptr [r12 + 8*rcx]
vmovupd zmm3, zmmword ptr [r12 + 8*rcx + 64]
vmovupd zmm4, zmmword ptr [r12 + 8*rcx + 128]
vmovupd zmm5, zmmword ptr [r12 + 8*rcx + 192]
vmovupd zmm6, zmmword ptr [r14 + 8*rcx]
vmovupd zmm7, zmmword ptr [r14 + 8*rcx + 64]
vmovupd zmm8, zmmword ptr [r14 + 8*rcx + 128]
vmovupd zmm9, zmmword ptr [r14 + 8*rcx + 192]
vmulpd zmm10, zmm2, zmmword ptr [rbx + 8*rcx]
vsubpd zmm6, zmm6, zmm10
vmulpd zmm10, zmm3, zmmword ptr [rbx + 8*rcx + 64]
vsubpd zmm7, zmm7, zmm10
...
Neither FMA nor iterators make a difference for whether this loop is vectorized, so any speedups are necessarily limited. |
If it's OK I'll link to this comment as the inspiration.
On the iterators versus loop: for some reason when I use the raw loop _nothing_ vectorizes, not even the obvious loop. What I read online was that bounds checking happens inside the loop body because Rust doesn't know where those indices are coming from. Using iterators instead is supposed to fix this, and it did seem to in my experiments.