|
|
|
|
|
by mikk14
974 days ago
|
|
Funnily enough I literally just today launched a big job on LUMI that I have started also on another smaller cluster with nVidia GPUs. Basically, I'm running Llama2-70B to do some zero-shot text classification. The nVidia setup uses 4 A100s, while on LUMI I could access 6 MI250Xs. It is, unfortunately, not an apples-to-apples comparison, because on the nVidia cluster I'm running it via llama-cpp-python and a quantized 34B version, while on LUMI I'm running the official non-quantized full 70B version via the transformers library. Long story short, I'm getting a 7.5x higher throughput from LUMI than on the nVidia cluster (which means each card is 5x faster on LUMI). Edit: The AMD GPUs work fine because one can run Pytorch for ROCm via the pytorch-triton-rocm package. |
|