Hacker News new | ask | show | jobs
by Lindon4290 721 days ago
Now, I don't have any MI300X, so I can't make any definite claims here. I am hoping someone else can replicate the results shown here or at the least educate me on how this is possible. Good part is the docker container and associated steps are made public - which is pretty cool!

Going by the video, the first thing that gave me pause was that a single MI300X is pulling off groq like performance, i.e. 314 tokens/second for a batch size = 1 (bs=1) with a prompt of 256 tokens and generation of 256 tokens. [1]

The Llama-2 70B is 128.48GB with FP16 (you can see this in the video). The entire model fits well within the 192GB HBM memory of the MI300X - which is awesome! However, for an regressive transformer model, during generation, the entire model weights are processed to generate a single next token. These models are "next token predictors" so to say, and you need the previous token to generate the next token. Therefore, the 128.48GB of model weights need are consumed from the HBM at the compute cores of the MI300X, per generated token. Note, I am not talking about the prefill - which only needs a single forward pass to generate the first output token. Every subsequent output token is auto-regressive.

The video shows that a single prompt (bs=1) with 256 token prompt and 256 tokens generated within 1.63 second. There is no tensor parallelism involved, or batching or anything else. This is a bs=1 case with a single card, and hence you can reason about the math fairly easy.

This shouldn't fundamentally be possible within the specs of the MI300X card. The card has a peak HBM memory bandwidth of 5.3 TB/s. You'll notice that to cycle through the weights (assuming FP16) 256 times, you'd need a minimum of 6 seconds, even at perfect ideal conditions. Napkin math: (256 * 128.48e9) / (5.3e12)

[1] https://wow.groq.com/groq-sets-new-large-language-model-perf...

1 comments

Elio clarified this to me, he said...

256 tokens in 1 prompt/batch

Not 256 batches

Yes, a single sequence with 256 prompt tokens and 256 output tokens. This is a batch size = 1. No one is saying anything about 256 batches.

The first step in understanding this is to notice that the model (llama2) generates 1 output token at a time. This is because the llama2 70B is a autoregressive decoder-only transformer.

Fundamentally, to generate a single output token you need to process the entire model weights. At each forward pass you generate 1 token.

OK, now to generate 256 output tokens - you need 256 sequential forward passes. At each forward pass, the entire model is read from the gpu VRAM.

Even at ideal memory bandwidth (5.3 TB/s) that (256 forward passes of a 128.48GB model) should take 6s.

The reported number of 1.63s should not be possible.

I'd strongly recommend checking for correctness - that the generate output is coherent. Try sending actual prompts to the "gemm tuned" model and observing the generated responses and latencies. With the "benchmark_throughput.py" you only get a final number and there is no check whether the output is valid or not.

I'm not sure which benchmark you mean here but I'll just comment that the chips and cheese article (which Elio worked on apparently ?) look like they used a batch size of 128 or so. Chips and cheese don't mention the batch size used though so hard to be 100% sure.