| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ladberg 746 days ago

Just to check here, if you're given something like the following PTX:

  wgmma.mma_async.sync.aligned.m64n256k16.f32.bf16.bf16

Do you reverse it back into C++ that does the corresponding FMAs manually instead of using tensor hardware? Or are you able to convert it into a series of __builtin_amdgcn_mfma_CDFmt_MxNxKABFmt instructions that emulate the same behavior?

1 comments

ckitching 745 days ago

Rather awkwardly, you've asked about an instruction that isn't currently implemented. :D Support for wmma and friends is in development.

But in general the answer to your question is yes: we use AMD-specific builtins where available/efficient to make things work. Otherwise many things would be unrepresentble, not just slow!

link

saagarjha 745 days ago

What do you do when a builtin doesn't exist?

link

ckitching 745 days ago

Add one: it's trivial to add a compiler builtin to carry the instruction from the frontend to the backend if an instruction exists and the backend knows about it.

If there's no instruction, either, you can write a C++ function to replicate the behaviour and codegen a call to it. Since the PTX blocks are expanded during initial IR generation, it all inlines nicely by the end. Of course, such software emulation is potentially suboptimal (depends on the situation).

link