|
|
|
|
|
by dotnet00
825 days ago
|
|
The tooling around ROCm is not as good (debuggers, profilers etc), and at least in my tangential experience (that is, involving GPGPU computation, but not for ML), custom operations are faster when written in CUDA code than in a high level Python wrapper (or, for that matter, using tools like OpenMP). Just as we write all our actually performance demanding code in C/C++, we write all our performance sensitive GPU code in CUDA (and obviously, performance is the entire point of putting in the effort to write GPU code). |
|