Develop against CUDA locally. Port my kernels to ROCm, and occupy a whole HPC node for debugging and performance tuning for a week. It’s terrible.
Edit: I should say that their recommendation is to write the kernels in ‘hip’ which is supposed to be their cross device wrapper for both cuda or ROCm. I’m writing in Julia however so that’s not possible.
The AMD software stack has been behind for a long time but I feel like we're finally catching up. I heard that HIP (and hopefully the rest of ROCM) is now supported on the RX6800XT consumer GPU... maybe that could help? BTW my team at AMD has been using Julia for ML workloads for a while. We should get in touch - maybe some of the lessons we learn can be useful to you. My email is claforte. The domain I'm sure you can guess. ;-)
BTW have you tried `KernelAbstractions.jl`? With it you can write code once that will run reasonably fast on AMD or NVIDIA GPUs or even on CPU. One of our engineers just started using it and is pleased with it - apparently the performance is nearly equivalent to native CUDA.jl or AMDGPU.jl, and the code is simpler.
Edit: I should say that their recommendation is to write the kernels in ‘hip’ which is supposed to be their cross device wrapper for both cuda or ROCm. I’m writing in Julia however so that’s not possible.