|
|
|
|
|
by hutzlibu
1114 days ago
|
|
GPU programming in general is definitely not trivial, as I can confirm with struggeling to learn WebGPU right now. But it really depends on the problem, simple math operations on lots of data is usually indeed trivially faster. Like AI mostly is with math on matrices. Or for example I just implemented a simple 2D raycastsolver in wgsl and as a first project it is totally not optimized - but even on my old laptop with crappy integrated GPU, but (relativly) fast CPU - I can now do 10000 raycasts per frame easily, while the cpu (wasm!) struggles with 500. The raw power of the gpu is really awesome. But every step is hard and debugging a mess. Which is why only a handful of people seems do be doing it. But now would probably be a good time, to get into it.. as I think gpu compute just has started and will get big. |
|
* Radix sort is your friend. Fun fact, O(n log n) is not the fastest a sort can run, it's the fastest a comparison-based sort can run. Radix sort runs in O(N) time, and in fact parallelizes extremely well on GPUs. Extremely. They are great at it. And there are in-place radix sorts too, just a bit slower (same asymptotic performance tho).
* "Insert an element into this collection" style steps can be replaced by a sort and a prefix-sum operation. If you know the offset of the first element with key K, and you know the offset of the first element with key J, you know the offset and size of that "collection" for K within a flat array ("size(K) = offset(J) - offset(K)"). Both of these run super fast in parallel and if you can tweak your problem around to be some kind of sorting operation that usually produces good speedups like this. Easiest way to get a speedup from everything I've heard, if your program works fast from a lookup table it's a good approach.
* Recomputing (or duplicate computation) is often much faster than storing intermediate results. "Procedural generation" is interesting because you can re-compute the generation step on demand. Random123 is also very nice compared to a (CuRand) mersenne twister/etc - why are you, a believer in the cryptographic maxim that hash(key, msg) is uncorrelated to hash(key, msg+1), still storing RNG state? Being able to play back arbitrary parts of a datastream at will is incredibly powerful, you can fastforward and rewind through the data previously used to interact with an item, as long as you know the epoch of the interaction you want for a particular key. And because computation is cheaper than memory, and memory bandwidth - it's really actually practically free in program time terms to just do some math. This is a form of data compression and performance enhancement.
* Generally you must understand the idea of divergence and memory coalescing/alignment to keep those lanes executing. And it is highly preferable to use sorts and prefix scans and butterfly operations (reduction, etc) even within a warp, because traditional "mutex/atomic" paradigms don't work well with 100k threads. But this is just the programming idioms of this particular platform, I am sure LISP is similar too in terms of "oh that's how you do that" once you're accustomed.
("This thread would like to write 3 items to an output array, can I get an alignment into a datablock the warp group is going to allocate from a global counter via atomic_add and write to" is another common task element in many parallel workloads. That is the same prefix-sum idiom, or a butterfly reduce, just at a warp level. And then when all the warps have finished, you can sort this unsorted output by a key array to align it, and start processing it again...)
* Cuda Unbound (CUB) provides reference implementations of all of these operations, and it handles portability across CUDA generations. Having good primitives makes a big difference, most of the workload isn't business logic, it's library calls, so using a library that just does it right gives you a lot of bang for buck. Don't write it yourself.
* Texture maps aren't just for graphics, they are a black box that lets the GPU perform 2D and 3D coalescing and some interpolation.
* Intelligent use of constants memory is another one, probably as is the use of CPU memory. If a value will be seldom accessed, you can probably stuff it into host memory and just accept the slowdown. Or you can store only epochs on the GPU and recompute intermediate values as needed. Try to ensure that all threads in a warp will do it too (host access vs recomputing).
* Raytracing is of course impervious to all of this (so don't worry too much that you can't magically hammer a speedup out of it, nobody really can). You can accelerate the raycasting and intersection testing (and AMD and NVIDIA and Intel all do this differently) but as a general matter rays are completely random and uncoalesced and divergent, they bypass almost all of the ideal GPGPU "hot paths". Ray sorting/shader execution reordering is something that needs hardware assistance, and Intel and NVIDIA both have hardware along these lines. The idea of Intel of making a facility for async future/promise dispatch for sparse tasks (and then sorting the shaders to get good coalescing/etc) is really neat and they've said it's going to come to GPGPU. https://youtu.be/SA1yvWs3lHU?t=289
* You can, however, use your rays more efficiently. And that's an area of active focus for everyone. And I think more efficient use of TAAU samples is probably where raster is going too.