| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Lindon4290 722 days ago

Right. On that track, I want to confirm something. Maybe I am doing my math wrong or don't understand how transformers work.

There is a video about the bs=1 case, i.e. a single prompt with input 256 tokens and output 256 tokens with a Llama-2 70B model, on a single MI300X (no tensor parallelism). The optimized result is giving 314 t/s and completes the entire request in 1.63s.

Now, the Llama-2 70B is a an autoregressive decoded-only model. So, that means the entire model weights are processed for every generated token.

The model weights are 128.48 GB (also shown in the video, and can be confirmed from HuggingFace). The card has 192GB of HBM, and the model entirely fits on the card. The HBM memory bandwidth for this card is 5.3 TB/s.

Even assuming the prompt 256 tokens took no time, to generate 256 tokens, we'd need roughly 6s even at ideal memory bandwidth. (256 * 128.48e9) / (5.3e12).

What's going on here?