|
|
|
|
|
by chillee
639 days ago
|
|
To be clear, this performance is quite bad (presumably because you didn't manage to get compilation working). You're getting 35 tokens/s for a 405B model, which comes out to about 85 Teraflops. 8 MI300x GPUs comes out to 10.4 Petaflops, so you're getting about 0.8% MFU (which is about 40-50x worse than decent training performance of 30-40% MFU). For AMD's sake, I hope that it's your software stack that's limiting perf. |
|
Their github page claims that it is possible to "tune LLaMa3.1 on Google Cloud TPUs for 30% lower cost", but they don't mention performance.