| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dragontamer 1775 days ago

Agreed.

The fundamental flaw of SIMD that _SHOULD_ be discussed, is branch divergence. Because of the way SIMD is designed, its probably hopeless for branch divergence to ever be solved.

The wider the SIMD, the more branch divergence messes up your performance. The narrower the SIMD, the less it matters.

CPUs have a form of branch-divergence slowdowns when branches are hard to predict (CPUs try to execute the future branch in parallel with the current code). So I guess branch divergence affects all code. But... GPUs are especially harmed by branch divergence, even more so than any CPU would be.

---------

This is different from the "fixed width" SIMD that is discussed in the blogpost. Any chosen width will have branch divergence.

GPUs don't really have a fixed width though. Through the magic of thread barrier commands, you can have anything from the native wavefront / warp width (32 on NVidia), all the way to 1024-wide thread groups.

But the advantages of very-wide groups is that 1024-at-a-time is sometimes easier to think about than 64-at-a-time. You really should just choose the width that makes most sense to your problem. Ex: 32x32 pixels is 1024-wide, while an 8x8 group of pixels is handled 64-wide.

3 comments

jjoonathan 1775 days ago

There's also the matter of branch divergence costing not only compute time, but programmer time, if the programming model is bad.

Using SIMD primitives that force me to pack my own vectors and handle all the divergence edge cases manually makes me want to stab my eyes out. Trying to get "CPU-style" auto-vectorization engines to infer vector semantics from a fully scalar program makes me want to stab my eyes out. Using "GPU-style" (NVidia calls it SIMT) auto-vectorization, which infers vector semantics by sweeping a kernel input parameter, is a breath of fresh air.

I get that hardware people want to focus on the hardware, not the programming interface, but the amount of good hardware that sank for want of a good programming interface is truly mind-blowing. Normally I wouldn't have expected 90% of an industry to repeatedly shoot itself in the foot for decades, but from an outsider's perspective that seems to be exactly what happened.

link

dragontamer 1775 days ago

I wish C++AMP actually got the investment and attention it deserved.

Microsoft was about 5 years too early on that one. No one understood its relevance when it first came out.

link

mycall 1775 days ago

I just read SYCL is inspired by C++AMP.

https://en.wikipedia.org/wiki/SYCL

link

wahern 1775 days ago

> its probably hopeless for branch divergence to ever be solved

Here's a partial solution, at least: Huihui Sun, Florian Fey, Jie Zhao, Sergei Gorlatch, "WCCV: improving the vectorization of IF-statements with warp-coherent conditions", 2019, https://www.di.ens.fr/~zhaojie/ics2019.pdf

> WCCV uses two different methods to detect warp-coherent conditions. The first method detects boolean-step conditions based on static affine analysis. Affine analysis is usually used for analyzing memory access patterns [15, 23], while we use affine analysis to analyze the variables and expressions used in conditions in order to detect a boolean-step condition. If the static affine analysis fails, we use the second method based on the branch probability estimation to identify high-probability conditions. We develop a cost model based on the estimated branch probability and branch cost: if a certain branch is more likely to be executed and the corresponding branch cost is greater than a threshold, we treat the corresponding condition as warp-coherent. We use auto-tuning to determine the optimal thresholds for various target platforms and applications.

Specifically, the detection of "boolean-step conditions" looks interesting. The fallback heuristic sounds like one of those things that doesn't extrapolate well in the wider software ecosystem.

link

lostmsu 1775 days ago

I like DirectX ray tracing's take on branch divergence: when a program processing a vector of rays has a branch, the vector is split into 2 subvectors by the taken branch. Then each group is rescheduled separately. The process is repeated, so the SIMD unit can be fully utilized until one of the subgroups becomes too small.

I often wonder if this approach can be used for general programming. E.g. for example I could think of a C parser, that you could feed 200 files, it would use SIMD unit of width 20 to separate them into 100 files, that start with a function declaration, and 100 files, that start with a variable declaration. Then the second group would be scheduled to descent (as in recursive descent), and split into 20/80 int/bool variables. All with SIMD fully utilized. etc

On the same note I am curious if it would be easy enough to write a code generator, that would use this pattern when translating regular programs.

link