|
|
|
|
|
by ajross
2119 days ago
|
|
That gather needs to issue 8 independent loads. It's never going to be fast, and I think there's a strong argument that you don't even want to spend the transistors on all the extra load/store units required. The goal of scatter/gather instructions is that they should be demonstrably faster than assembling the values in scalar code, and beyond that... meh. If you're doing random access to memory like that, you're probably out of the realm of what is appropriate in vector code and should be looking at other hardware (c.f. a GPU's texture units) to manage your memory access. |
|