| Now, I don't have any MI300X, so I can't make any definite claims here. I am hoping someone else can replicate the results shown here or at the least educate me on how this is possible. Good part is the docker container and associated steps are made public - which is pretty cool! Going by the video, the first thing that gave me pause was that a single MI300X is pulling off groq like performance, i.e. 314 tokens/second for a batch size = 1 (bs=1) with a prompt of 256 tokens and generation of 256 tokens. [1] The Llama-2 70B is 128.48GB with FP16 (you can see this in the video). The entire model fits well within the 192GB HBM memory of the MI300X - which is awesome! However, for an regressive transformer model, during generation, the entire model weights are processed to generate a single next token. These models are "next token predictors" so to say, and you need the previous token to generate the next token. Therefore, the 128.48GB of model weights need are consumed from the HBM at the compute cores of the MI300X, per generated token. Note, I am not talking about the prefill - which only needs a single forward pass to generate the first output token. Every subsequent output token is auto-regressive. The video shows that a single prompt (bs=1) with 256 token prompt and 256 tokens generated within 1.63 second. There is no tensor parallelism involved, or batching or anything else. This is a bs=1 case with a single card, and hence you can reason about the math fairly easy. This shouldn't fundamentally be possible within the specs of the MI300X card. The card has a peak HBM memory bandwidth of 5.3 TB/s. You'll notice that to cycle through the weights (assuming FP16) 256 times, you'd need a minimum of 6 seconds, even at perfect ideal conditions. Napkin math: (256 * 128.48e9) / (5.3e12) [1] https://wow.groq.com/groq-sets-new-large-language-model-perf... |
256 tokens in 1 prompt/batch
Not 256 batches