Funnily enough I literally just today launched a big job on LUMI that I have started also on another smaller cluster with nVidia GPUs. Basically, I'm running Llama2-70B to do some zero-shot text classification. The nVidia setup uses 4 A100s, while on LUMI I could access 6 MI250Xs.
It is, unfortunately, not an apples-to-apples comparison, because on the nVidia cluster I'm running it via llama-cpp-python and a quantized 34B version, while on LUMI I'm running the official non-quantized full 70B version via the transformers library.
Long story short, I'm getting a 7.5x higher throughput from LUMI than on the nVidia cluster (which means each card is 5x faster on LUMI).
Edit: The AMD GPUs work fine because one can run Pytorch for ROCm via the pytorch-triton-rocm package.
I haven't tried it unfortunately, and I don't really have data to make an educated guess. I have played a little bit with some training and it seemed a bit slow, but the environment for testing is not really representative of the speed for submitted jobs -- even my inference in the testing environment was pretty slow, but once submitted the runtimes were very different.
You can run LLMs in your own machine. Do you think a super computer would have issues? CUDA has optimizations, but you don't necessarily need it to do inference at all.
Those super computers are extremely powerful, it might not be as energy efficient as H100s, but it does the job.
There has been a lot of recent progress in making PyTorch AMD compatible exactly because many government/university supercomputers are based on AMD GPUs.
It is, unfortunately, not an apples-to-apples comparison, because on the nVidia cluster I'm running it via llama-cpp-python and a quantized 34B version, while on LUMI I'm running the official non-quantized full 70B version via the transformers library.
Long story short, I'm getting a 7.5x higher throughput from LUMI than on the nVidia cluster (which means each card is 5x faster on LUMI).
Edit: The AMD GPUs work fine because one can run Pytorch for ROCm via the pytorch-triton-rocm package.