Hacker News new | ask | show | jobs
by mikk14 974 days ago
Funnily enough I literally just today launched a big job on LUMI that I have started also on another smaller cluster with nVidia GPUs. Basically, I'm running Llama2-70B to do some zero-shot text classification. The nVidia setup uses 4 A100s, while on LUMI I could access 6 MI250Xs.

It is, unfortunately, not an apples-to-apples comparison, because on the nVidia cluster I'm running it via llama-cpp-python and a quantized 34B version, while on LUMI I'm running the official non-quantized full 70B version via the transformers library.

Long story short, I'm getting a 7.5x higher throughput from LUMI than on the nVidia cluster (which means each card is 5x faster on LUMI).

Edit: The AMD GPUs work fine because one can run Pytorch for ROCm via the pytorch-triton-rocm package.

1 comments

Thanks, that's great to know. Do you know how they would compare if you performed training instead of inference?
I haven't tried it unfortunately, and I don't really have data to make an educated guess. I have played a little bit with some training and it seemed a bit slow, but the environment for testing is not really representative of the speed for submitted jobs -- even my inference in the testing environment was pretty slow, but once submitted the runtimes were very different.