| HN Mirror

The idea is to serve models that would normally be considered too large for GPU memory (70 billion parameters at 16 bytes each for 140 GB of memory required). Some people figured out you can offload the model and only have parts of it loaded so a 24 GB GPU like the 4090 can still serve the model, but it goes a lot slower. They have a new way to serve the same model on the same GPU but 8x better throughput. Something about decoding tokens on a smaller model maybe, then just checking multiple tokens on the larger model in a single batch. Magic, but ultimately its the same model, same GPU, same output as before, but much better throughout.