|
|
|
|
|
by wongarsu
5 days ago
|
|
The tradeoff gets better the bigger your primary model, and probably with bigger batch sizes. The KV cache can consume a lot of expensive VRAM, and the VRAM and compute costs of the predictor model become a small fraction of the cost of the primary model For serving a 1T model with 16 concurrent requests this could make a lot of sense. For a 8B model with a single request far less so |
|