This is one of the best uses I've found for Singeli[0]. Here's how I implemented an AVX2 transpose kernel similar to that in transpose_Vec256_kernel, for generic type and vector/kernel size (I use unpack instructions rather than shuffle and blend, which I think is probably faster since it's just one instruction for each interaction):
The language is oriented towards compile-time array programming instead of managing a bunch of individual vectors. So you have runtime vec_select{} (docs at [1]), mirrored by compile-time select{}, and the indices generated by pairs{} can be used in either.