Hacker News new | ask | show | jobs
by Const-me 582 days ago
> For 969 tok/s in int8, you need 392 TB/s memory bandwidth

I think that math is only valid for batch size = 1. When these 969 tokens/second come from multiple sessions of the same batch, loaded model tensor elements are reused to compute many tokens for the entire batch. With large enough batches, you can even saturate compute throughput of the GPU instead of bottlenecking on memory bandwidth.

1 comments

They claim to obtain that number with 8 to 20 concurrent users:

https://x.com/draecomino/status/1858998347090325846