Hacker News new | ask | show | jobs
by coolspot 281 days ago
Yes, but you don’t know which 3B parameters you will need, so you have to keep all 80B in your VRAM, or wait until correct 3B are loaded from NVMe->RAM->VRAM. And of course it could be different 3B for each next token.
1 comments

The latest SSDs benchmark at 3GB/s and up. The marginal latency would be trivial compared to the inference time.