Hacker News new | ask | show | jobs
by latchkey 721 days ago
Elio clarified this to me, he said...

256 tokens in 1 prompt/batch

Not 256 batches

2 comments

Yes, a single sequence with 256 prompt tokens and 256 output tokens. This is a batch size = 1. No one is saying anything about 256 batches.

The first step in understanding this is to notice that the model (llama2) generates 1 output token at a time. This is because the llama2 70B is a autoregressive decoder-only transformer.

Fundamentally, to generate a single output token you need to process the entire model weights. At each forward pass you generate 1 token.

OK, now to generate 256 output tokens - you need 256 sequential forward passes. At each forward pass, the entire model is read from the gpu VRAM.

Even at ideal memory bandwidth (5.3 TB/s) that (256 forward passes of a 128.48GB model) should take 6s.

The reported number of 1.63s should not be possible.

I'd strongly recommend checking for correctness - that the generate output is coherent. Try sending actual prompts to the "gemm tuned" model and observing the generated responses and latencies. With the "benchmark_throughput.py" you only get a final number and there is no check whether the output is valid or not.

I'm not sure which benchmark you mean here but I'll just comment that the chips and cheese article (which Elio worked on apparently ?) look like they used a batch size of 128 or so. Chips and cheese don't mention the batch size used though so hard to be 100% sure.