Is ROCm actually usable in this years machine learning ecosystem? Can I just drop in any PyTorch model that was developed on CUDA and expect it to work?
Is ROCm actually usable in this years machine learning ecosystem?
I don't known, as I'm only just now building out my first AMD based ML machine to run ROCm. All I can really say is that AMD really seem to be making a genuine effort to get ROCm to that level. See the two links I submitted yesterday[1][2] for more details.
The two things in particular that stand out to me from all this are:
1. They are at least publicly declaring their intention to make ROCm a player in AI/ML. Previously there was at least a perception (and quite possibly a reality) that ROCm was more focused on other HPC workloads and not really AI / ML. AMD seems committed to changing that.
2. It seems that they are finally serious about getting ROCm working on their consumer Radeon cards. Even though 5.6 didn't include the long hoped-for announcement of such support, the blog post they put out did at least officially declare their intent to do so in a release this fall. And maybe more to the point, the batch of changes in 5.6 did actually include some fixes for problems encountered running on Radeon cards, even though they aren't yet officially listed as supported.
On my projects it kind of works. Given the usual driver installing work. Depending on the card you'd have to use some env flags like HSA_OVERRIDE_GFX_VERSION to make it not crash.
On an MI250+ system or other similar architectures that mirror what El-Capitan is going to look like, ROCm is stable and there are pytorch + cupy backends for it. It mostly just works. If you have custom kernels as part of your pipeline you'd need to convert them from CUDA to HIP though.
If you're looking for something on AMD consumer cards...then you have to keep waiting.
I don't known, as I'm only just now building out my first AMD based ML machine to run ROCm. All I can really say is that AMD really seem to be making a genuine effort to get ROCm to that level. See the two links I submitted yesterday[1][2] for more details.
The two things in particular that stand out to me from all this are:
1. They are at least publicly declaring their intention to make ROCm a player in AI/ML. Previously there was at least a perception (and quite possibly a reality) that ROCm was more focused on other HPC workloads and not really AI / ML. AMD seems committed to changing that.
2. It seems that they are finally serious about getting ROCm working on their consumer Radeon cards. Even though 5.6 didn't include the long hoped-for announcement of such support, the blog post they put out did at least officially declare their intent to do so in a release this fall. And maybe more to the point, the batch of changes in 5.6 did actually include some fixes for problems encountered running on Radeon cards, even though they aren't yet officially listed as supported.
[1]: https://news.ycombinator.com/item?id=36522683
[2]: https://news.ycombinator.com/item?id=36522876