|
|
|
|
|
by ajross
2121 days ago
|
|
That's the way normal SIMD loads work, yeah. But the scatter/gather instructions do random access memory operations. You have one SIMD register with a 8 (or whatever the width is) indexes to be applied to a base address in a scalar register, and the hardware then goes and does 8 separate memory operations on your behalf, packing the results into a SIMD register at the end. That has to hit the cache 8 times in the general case. It's extremely expensive as a single instruction, though faster than running scalar code to do the same thing. |
|