But the scatter/gather instructions do random access memory operations. You have one SIMD register with a 8 (or whatever the width is) indexes to be applied to a base address in a scalar register, and the hardware then goes and does 8 separate memory operations on your behalf, packing the results into a SIMD register at the end.
That has to hit the cache 8 times in the general case. It's extremely expensive as a single instruction, though faster than running scalar code to do the same thing.
But the scatter/gather instructions do random access memory operations. You have one SIMD register with a 8 (or whatever the width is) indexes to be applied to a base address in a scalar register, and the hardware then goes and does 8 separate memory operations on your behalf, packing the results into a SIMD register at the end.
That has to hit the cache 8 times in the general case. It's extremely expensive as a single instruction, though faster than running scalar code to do the same thing.