|
|
|
|
|
by ladberg
698 days ago
|
|
Just to check here, if you're given something like the following PTX: wgmma.mma_async.sync.aligned.m64n256k16.f32.bf16.bf16
Do you reverse it back into C++ that does the corresponding FMAs manually instead of using tensor hardware? Or are you able to convert it into a series of __builtin_amdgcn_mfma_CDFmt_MxNxKABFmt instructions that emulate the same behavior? |
|
But in general the answer to your question is yes: we use AMD-specific builtins where available/efficient to make things work. Otherwise many things would be unrepresentble, not just slow!