|
|
|
|
|
by anthonix1
698 days ago
|
|
I just tried it with llm.c ... seems to be missing quite a few key components such as cublaslt, bfloat16 support, nvtx3, compiler flags such as -t And its linked against an old release of ROCm. So unclear to me how it is supposed to be an improvement over something like hipify |
|
It appears we implemented `--threads` but not `-t` for the compiler flag. Oeps. In either case, the flag has no effect at present, since fatbinary support is still in development, and that's the only part of the process that could conceivably be parallelised.
That said: clang (and hence the SCALE compiler) tends to compile CUDA much faster than nvcc does, so this lack of the parallelism feature is less problematic than it might at first seem.
NVTX support (if you want more than just "no-ops to make the code compile") requires cooperation with the authors of profilers etc., which has not so far been available
bfloat16 is not properly supported by AMD anyway: the hardware doesn't do it, and HIP's implementatoin just lies and does the math in `float`. For that reason we haven't prioritised putting together the API.
cublasLt is a fair cop. We've got a ticket :D.