Hacker News new | ask | show | jobs
by zargon 373 days ago
> I feel that the GPU is still the bottleneck here, not the bus performance.

PCIe bus performance is basically irrelevant.

> Token generation is completely on the card using the memory on the card, without any bus IO at all, no?

Right. But the GPU can't instantaneously access data in VRAM. It has to be copied from VRAM to GPU registers first. For every token, the entire contents of VRAM has to be copied to the GPU to be computed. It's a memory-bound process.

Right now there's about an 8x difference in memory bandwidth between low-end and high-end consumer cards (e.g., 4060 Ti vs 5090). Moving up to a B200 more than doubles that performance again.