Hacker News new | ask | show | jobs
by freeqaz 781 days ago
So this is 8x faster for serving these models than before? Or is this about it being more deterministic? I can't quite tell from reading it.
1 comments

The idea is to serve models that would normally be considered too large for GPU memory (70 billion parameters at 16 bytes each for 140 GB of memory required). Some people figured out you can offload the model and only have parts of it loaded so a 24 GB GPU like the 4090 can still serve the model, but it goes a lot slower. They have a new way to serve the same model on the same GPU but 8x better throughput. Something about decoding tokens on a smaller model maybe, then just checking multiple tokens on the larger model in a single batch. Magic, but ultimately its the same model, same GPU, same output as before, but much better throughout.