Hacker News new | ask | show | jobs
by ryao 581 days ago
Memory bandwidth for inferencing does not scale with the number of GPUs. Scaling instead requires more concurrent users. Also, I am told that 8 H100 cards can achieve 600 to 1000 tokens per second with concurrent users.
1 comments

8 H100 could achieve lot more than 1000 token/sec.

> Memory bandwidth for inferencing does not scale with the number of GPU

It does

This is on llama 3.1 405B.

Inferencing is memory bandwidth bound. Add more GPUs on a batch size 1 inference problem and watch it run no faster than the memory bandwidth of a single GPU. It does not scale across the number of GPUs. If it could, you would see clusters of Nvidia hardware outperforming Cerebras’ hardware. That is currently a fantasy.