|
|
|
|
|
by Lindon4290
721 days ago
|
|
Yes, a single sequence with 256 prompt tokens and 256 output tokens. This is a batch size = 1. No one is saying anything about 256 batches. The first step in understanding this is to notice that the model (llama2) generates 1 output token at a time. This is because the llama2 70B is a autoregressive decoder-only transformer. Fundamentally, to generate a single output token you need to process the entire model weights. At each forward pass you generate 1 token. OK, now to generate 256 output tokens - you need 256 sequential forward passes. At each forward pass, the entire model is read from the gpu VRAM. Even at ideal memory bandwidth (5.3 TB/s) that (256 forward passes of a 128.48GB model) should take 6s. The reported number of 1.63s should not be possible. I'd strongly recommend checking for correctness - that the generate output is coherent. Try sending actual prompts to the "gemm tuned" model and observing the generated responses and latencies. With the "benchmark_throughput.py" you only get a final number and there is no check whether the output is valid or not. |
|