| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by coolspot 281 days ago
	Yes, but you don’t know which 3B parameters you will need, so you have to keep all 80B in your VRAM, or wait until correct 3B are loaded from NVMe->RAM->VRAM. And of course it could be different 3B for each next token.

1 comments

The latest SSDs benchmark at 3GB/s and up. The marginal latency would be trivial compared to the inference time.