| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ckitching 745 days ago

Hi! Spectral engineer here!

SCALE does not use any part of ZLUDA. We have modified the clang frontend to convert inline PTX asm block to LLVM IR.

To put in a less compiler-engineer-ey way: for any given block of PTX, there exists a hypothetical sequence of C++/CUDA code you could have written to achieve the same effect, but on AMD (perhaps using funky __builtin_... functions if the code includes shuffles/ballots/other-weird-gpu-stuff). Our compiler effectively converts the PTX into that hypothetical C++.

Regarding memory consistency etc.: NVIDIA document the "CUDA memory consistency model" extremely thoroughly, and likewise, the consistency guarantees for PTX. It is therefore sufficient to ensure that we use operations at least as synchronising as those called for in the documented semantics of the language (be it CUDA or PTX, for each operation).

Differing consistency _between architectures_ is the AMDGPU backend's problem.

2 comments

lmeyerov 745 days ago

Ah I was reading the 'deeper dive' section on my phone and missed it was a comparison, not a warning, thank you

I'm curious how something like this example would translate:

===

Mapping lower-level ptx patterns to higher-level AMD constructs like __ballot, and knowing it's safe

```

  #ifdef INLINEPTX
  inline uint ptx_thread_vote(float rSq, float rCritSq) {
      uint result = 0;
      asm("{\n\t"
           ".reg .pred cond, out;\n\t"
           "setp.ge.f32 cond, %1, %2;\n\t"
           "vote.sync.all.pred out, cond, 0xffffffff;\n\t"
           "selp.u32 %0, 1, 0, out;\n\t"
           "}\n\t"
           : "=r"(result)
           : "f"(rSq), "f"(rCritSq));
      return result;
  }
  #endif

```

===

Again, I'm guessing there might be an equiv simpler program involving AMD's __ballot, but I'm unsure of the true equivalence wrt safety, and it seems like a tricky rewrite as it needs to (afaict) decompile to recover the higher-level abstraction. Normally it's easier to compile down or sideways (translate), and it's not clear to me these primitives are 1:1 for safely doing so.

===

FWIW, this is all pretty cool. We stay away from PTX -- most of our app code is higher-level, whether RAPIDS (GPU dataframes, GPU ML, etc libs), minimal cuda, and minimal opencl, with only small traces of inline ptx. So more realistically, if we had the motivation, we'd likely explore just #ifdef'ing it with something predictable.

link

ckitching 744 days ago

I compiled your function with SCALE for gfx1030:

        .p2align        2                               ; -- Begin function _Z15ptx_thread_voteff
        .type   _Z15ptx_thread_voteff,@function
  _Z15ptx_thread_voteff:                  ; @_Z15ptx_thread_voteff
  ; %bb.0:                                ; %entry
        s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
        s_waitcnt_vscnt null, 0x0
        v_cmp_ge_f32_e32 vcc_lo, v0, v1
        s_cmp_eq_u32 vcc_lo, -1
        s_cselect_b32 s4, -1, 0
        v_cndmask_b32_e64 v0, 0, 1, s4
        s_setpc_b64 s[30:31]
  .Lfunc_end1:
        .size   _Z15ptx_thread_voteff, .Lfunc_end1-_Z15ptx_thread_voteff
                                        ; -- End function

What were the safety concerns you had? This code seems to be something like `return __all_sync(rSq >= rCritSq) ? 1 : 0`, right?

link

lmeyerov 744 days ago

It's supposed to be waiting for all threads to vote

I'm not familiar with AMD enough to know if additional synchronization is needed. ChatGPT recommended adding barriers beyond what that gave, but again, I'm not familiar with AMD commands.

link

ckitching 744 days ago

Indeed, no extra synchronisation is needed here due to the nature of the hardware (threads in a warp can't get out of sync with each other).

Even on NVIDIA, you could've written this without the asm a discussed above!

link

lmeyerov 744 days ago

Yeah I think, after this snippet was written, cuda added __all_sync as an intrinsic. The divergent code before this was plain-ish cuda, and this snippet ensures they wait on the comparison vote before recurring.

So in the AMD version, the compiler correctly realized the synchronization was on the comparison, so adds the AMD version right before it. That seems like a straightforward transform here.

It'd be interesting to understand the comparison of what Nvidia primitives map vs what doesn't. The above is a fairly simple barrier. We avoided PTX as much as we could and wrote it as simply as we could, I'd expect most of our PTX to port for similar reasons. The story is a bit diff for libraries we call. E.g., cudf probably has little compute-tier ptx directly, but will call nvidia libs, and use weird IO bits like cufile / gpu direct storage.

link

ladberg 745 days ago

Just to check here, if you're given something like the following PTX:

  wgmma.mma_async.sync.aligned.m64n256k16.f32.bf16.bf16

Do you reverse it back into C++ that does the corresponding FMAs manually instead of using tensor hardware? Or are you able to convert it into a series of __builtin_amdgcn_mfma_CDFmt_MxNxKABFmt instructions that emulate the same behavior?

link

ckitching 745 days ago

Rather awkwardly, you've asked about an instruction that isn't currently implemented. :D Support for wmma and friends is in development.

But in general the answer to your question is yes: we use AMD-specific builtins where available/efficient to make things work. Otherwise many things would be unrepresentble, not just slow!

link

saagarjha 745 days ago

What do you do when a builtin doesn't exist?

link

ckitching 744 days ago

Add one: it's trivial to add a compiler builtin to carry the instruction from the frontend to the backend if an instruction exists and the backend knows about it.

If there's no instruction, either, you can write a C++ function to replicate the behaviour and codegen a call to it. Since the PTX blocks are expanded during initial IR generation, it all inlines nicely by the end. Of course, such software emulation is potentially suboptimal (depends on the situation).

link