| HN Mirror

GPUs are TERRIBLE at executing code with tons of branches.

Basically, GPUs execute instructions in lockstep groups of threads. Each group executes the same instruction at the same time. If there's a conditional, and only some of the threads in a group have a state that satisfies the condition, then the group is split and the paths are executed in serial rather than parallel. The threads following the "true" path execute while the threads that need to take the "false" path sit idle. Once the "true" threads complete, they sit idle while the "false" threads execute. Only once both threads complete do they reconverge and continue.

They're designed this way because it greatly simplifies the hardware. You don't need huge branch predictors or out-of-order execution engines, and it allows you to create a processor with thousands of cores (The RTX 5090 has over 24,000 CUDA cores!) without needing thousands of instruction decoders, which would be necessary to allow each core to do its own thing.

There ARE ways to work around this. For example, it can sometimes be faster to compute BOTH sides of a branch, but then merely apply the "if" on which result to select. Then, each thread would merely need to apply an assignment, so the stalls only last for an instruction or two.

Of course, it's worth noting that this non-optimal behavior is only an issue with divergent branches. If every thread decides the "if" is true, there's no performance penalty.