| HN Mirror

It’s not about brute force, but about trying to do the exact same calculation on every thread. Efficiency in numpy or on a GPU comes from avoiding “divergence”, which is what it’s called on a GPU when some threads execute different instructions than other threads in the same thread group. If one thread executes a unique instruction, all the other threads have to stall and wait for it. If all the threads execute unique blocks, the waiting becomes catastrophic and slower than single-theaded code. But if they all do the same thing, the machine will fly. Sometimes avoiding divergence means doing things that seem counter-intuitive compared to CPU single-threaded code, which is why it has a reputation for being brute force, but really it’s just a different set of efficiency tricks.

It is true that you don’t have to worry as much about repeating calculations. I think you’re referring to “rematerialization”, meaning after doing some non-trivial calculation once and using the result, throwing it away and redoing the same calculation again later on the same thread. It’s true this can sometimes be advantageous, mostly because memory use is so expensive. One load or store into VRAM can be as expensive as 10 or sometimes even 100 math instructions, so if your store & load takes 40 cycles, and recomputing something takes 25 cycles of math using registers, then recomputing can be faster.

I second the sibling recommendation to learn numpy, it’s a different way of thinking than single-threaded functional programming with lists & maps. Try writing some kind of image filter in Python both ways, and get a feel for the performance difference. If you’re familiar with Python, this is a one or two hour exercise. Last time I tried it, my numpy version was ~2 orders of magnitude faster than the lists & maps version.

One of the most fun ways to learn SIMD programming, in my humble opinion, is to study the shaders on ShaderToy. ShaderToy makes it super simple to write GPU code and see the result. Some of the tricks people use are very clever, but after studying them for a while and trying a few yourself, you’ll start to see themes emerge about how to organize parallel image computations.