Hacker News new | ask | show | jobs
by Veliladon 1365 days ago
Because you can do things using bitmasks and single instructions instead of brute forcing using multiple instructions.

Let's say you have a whole heap of 8-bit numbers you want to multiply by 2 and you have a set of 256-bit registers and a nice SIMD multiply command. If you don't have a shuffle you need to assemble your series of 2s for the second operand for each lane before you can even start. This is going to take hundreds of instructions and hundreds of clocks. Shuffle means you load up lane 0 with the "2" and then splat the contents of lane 0 across the other 31 lanes in two instructions and a few clocks using the shuffle unit.

N.B. Shuffle isn't just about splatting. There's a whole heap of different operations it can do that are useful. I just picked a simple example with an obvious massive performance increase for illustrative purposes.

2 comments

You're not talking about shuffle, you're talking about broadcast. Shuffle instructions is where you take one or two vectors, and output a third with elements from any index of the input. So for example `out = [in[2], in[1]]` is a shuffle of a vector of length 2.

It's useful for example if you have say RGB color data stored contiguously in memory as say RGBRGBRGBRGB..., and you want to vectorize operations on R, B and G separately. You can load a few registers like [RGBR][GBRG][BRGB], and then shuffle them to [RRRR][BBBB][GGGG]. In fact it's not entirely trivial how to shuffle optimally, it takes a few shuffles to get there.

More generally, if you have an array of structs, you often need to go to struct of arrays to do vectorized operations on the array, before returning to an array of struct again.

Another example is fast matrix transpose (in fact you can think of the RGB example a 3 by N matrix transpose to N by 3, where N is the vector width -- AoS -> SoA is a transpose too, in a sense). Suppose you have a matrix of size N by N where N is the vector width, you need N lg N shuffles to transpose the matrix.

I think that example is too simple to show the benefit of shuffle. It's like explaining the benefit of an adder by showing how you can move a value with X = Y + 0. Especially since there's also a (much simpler) piece of hardware dedicated to ultra-fast splat/broadcast (under the right conditions).