| HN Mirror

Yes, a single sequence with 256 prompt tokens and 256 output tokens. This is a batch size = 1. No one is saying anything about 256 batches.

The first step in understanding this is to notice that the model (llama2) generates 1 output token at a time. This is because the llama2 70B is a autoregressive decoder-only transformer.

Fundamentally, to generate a single output token you need to process the entire model weights. At each forward pass you generate 1 token.

OK, now to generate 256 output tokens - you need 256 sequential forward passes. At each forward pass, the entire model is read from the gpu VRAM.

Even at ideal memory bandwidth (5.3 TB/s) that (256 forward passes of a 128.48GB model) should take 6s.

The reported number of 1.63s should not be possible.

I'd strongly recommend checking for correctness - that the generate output is coherent. Try sending actual prompts to the "gemm tuned" model and observing the generated responses and latencies. With the "benchmark_throughput.py" you only get a final number and there is no check whether the output is valid or not.