Hacker News new | ask | show | jobs
by chillee 639 days ago
To be clear, this performance is quite bad (presumably because you didn't manage to get compilation working).

You're getting 35 tokens/s for a 405B model, which comes out to about 85 Teraflops. 8 MI300x GPUs comes out to 10.4 Petaflops, so you're getting about 0.8% MFU (which is about 40-50x worse than decent training performance of 30-40% MFU).

For AMD's sake, I hope that it's your software stack that's limiting perf.

1 comments

That's exactly what I wanted to ask:

Their github page claims that it is possible to "tune LLaMa3.1 on Google Cloud TPUs for 30% lower cost", but they don't mention performance.