| HN Mirror

You're not talking about shuffle, you're talking about broadcast. Shuffle instructions is where you take one or two vectors, and output a third with elements from any index of the input. So for example `out = [in[2], in[1]]` is a shuffle of a vector of length 2.

It's useful for example if you have say RGB color data stored contiguously in memory as say RGBRGBRGBRGB..., and you want to vectorize operations on R, B and G separately. You can load a few registers like [RGBR][GBRG][BRGB], and then shuffle them to [RRRR][BBBB][GGGG]. In fact it's not entirely trivial how to shuffle optimally, it takes a few shuffles to get there.

More generally, if you have an array of structs, you often need to go to struct of arrays to do vectorized operations on the array, before returning to an array of struct again.

Another example is fast matrix transpose (in fact you can think of the RGB example a 3 by N matrix transpose to N by 3, where N is the vector width -- AoS -> SoA is a transpose too, in a sense). Suppose you have a matrix of size N by N where N is the vector width, you need N lg N shuffles to transpose the matrix.