|
|
|
|
|
by ckitching
698 days ago
|
|
Hi! Spectral engineer here! SCALE does not use any part of ZLUDA. We have modified the clang frontend to convert inline PTX asm block to LLVM IR. To put in a less compiler-engineer-ey way: for any given block of PTX, there exists a hypothetical sequence of C++/CUDA code you could have written to achieve the same effect, but on AMD (perhaps using funky __builtin_... functions if the code includes shuffles/ballots/other-weird-gpu-stuff). Our compiler effectively converts the PTX into that hypothetical C++. Regarding memory consistency etc.: NVIDIA document the "CUDA memory consistency model" extremely thoroughly, and likewise, the consistency guarantees for PTX. It is therefore sufficient to ensure that we use operations at least as synchronising as those called for in the documented semantics of the language (be it CUDA or PTX, for each operation). Differing consistency _between architectures_ is the AMDGPU backend's problem. |
|
I'm curious how something like this example would translate:
===
Mapping lower-level ptx patterns to higher-level AMD constructs like __ballot, and knowing it's safe
```
```===
Again, I'm guessing there might be an equiv simpler program involving AMD's __ballot, but I'm unsure of the true equivalence wrt safety, and it seems like a tricky rewrite as it needs to (afaict) decompile to recover the higher-level abstraction. Normally it's easier to compile down or sideways (translate), and it's not clear to me these primitives are 1:1 for safely doing so.
===
FWIW, this is all pretty cool. We stay away from PTX -- most of our app code is higher-level, whether RAPIDS (GPU dataframes, GPU ML, etc libs), minimal cuda, and minimal opencl, with only small traces of inline ptx. So more realistically, if we had the motivation, we'd likely explore just #ifdef'ing it with something predictable.