| Agreed. The fundamental flaw of SIMD that _SHOULD_ be discussed, is branch divergence. Because of the way SIMD is designed, its probably hopeless for branch divergence to ever be solved. The wider the SIMD, the more branch divergence messes up your performance. The narrower the SIMD, the less it matters. CPUs have a form of branch-divergence slowdowns when branches are hard to predict (CPUs try to execute the future branch in parallel with the current code). So I guess branch divergence affects all code. But... GPUs are especially harmed by branch divergence, even more so than any CPU would be. --------- This is different from the "fixed width" SIMD that is discussed in the blogpost. Any chosen width will have branch divergence. GPUs don't really have a fixed width though. Through the magic of thread barrier commands, you can have anything from the native wavefront / warp width (32 on NVidia), all the way to 1024-wide thread groups. But the advantages of very-wide groups is that 1024-at-a-time is sometimes easier to think about than 64-at-a-time. You really should just choose the width that makes most sense to your problem. Ex: 32x32 pixels is 1024-wide, while an 8x8 group of pixels is handled 64-wide. |
Using SIMD primitives that force me to pack my own vectors and handle all the divergence edge cases manually makes me want to stab my eyes out. Trying to get "CPU-style" auto-vectorization engines to infer vector semantics from a fully scalar program makes me want to stab my eyes out. Using "GPU-style" (NVidia calls it SIMT) auto-vectorization, which infers vector semantics by sweeping a kernel input parameter, is a breath of fresh air.
I get that hardware people want to focus on the hardware, not the programming interface, but the amount of good hardware that sank for want of a good programming interface is truly mind-blowing. Normally I wouldn't have expected 90% of an industry to repeatedly shoot itself in the foot for decades, but from an outsider's perspective that seems to be exactly what happened.