| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wongarsu 52 days ago
	The tradeoff gets better the bigger your primary model, and probably with bigger batch sizes. The KV cache can consume a lot of expensive VRAM, and the VRAM and compute costs of the predictor model become a small fraction of the cost of the primary model For serving a 1T model with 16 concurrent requests this could make a lot of sense. For a 8B model with a single request far less so

1 comments

0-_-0 52 days ago

This can't be used to save VRAM in practice. To generate a new token with the primary model, you first need to decompress the cache, which involves regenerating the whole sequence from scratch. I.e. generate 1 million tokens with the small model to generate 1 with the large.

link