| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by 3abiton 650 days ago
	This is still amazing work, imagine running chungus models on a single 3090.

1 comments

Sohcahtoa82 649 days ago

The bottleneck on a consumer-grade GPU like a 3090 isn't the processing power, it's the lack of RAM. The PCI-Express bus ends up being your bottleneck from having to swap in parts of the model.

Even with PCIe 5.0 and 16 lanes, you only get 64 GB/s of bandwidth. If you're trying to run a model too big for your GPU, then for every token, it has to reload the entire model. With a 70B parameter model, 8 bit quantization, you're looking at just under 1 token/sec just from having to transfer parts of the model in constantly. Making the actual computation faster won't make it any faster.

link

dragonwriter 649 days ago

OTOH, doesn't it also mean that (given appropriate software framework support) iGPUs with less processing capacity and slower-but-more RAM available (because system RAM is comparatively cheap and plentiful compared to VRAM) without swapping anything are more competitive against consumer dGPUs with fast-but-small RAM for both inference and training with larger models?

link

Sohcahtoa82 649 days ago

System memory isn't that fast, either. Even with DDR5-8400, the fastest memory you can get right now, you're only looking at a memory transfer speed of 67.2 GB/s, barely faster than the PCI-E bus. So even if you could store that entire 70B model in RAM, you're still getting just under 1 token/sec, and that's assuming your CPU doesn't become a bottleneck.

Your best bet would likely be a laptop that has integrated system RAM with VRAM, but I don't think any of those offer enough RAM to store an entire 70B model. A 7B parameter model would work fine, but you could do those on a consumer-grade GPU anyways.

link

DeveloperErrata 649 days ago

Macbook Pros with M3 & integrated RAM & VRAM can do 70B models :)

link