| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by killerstorm 8 days ago
	While prefill is bottlenecked by GPU compute time, decode might be bottlenecked by GPU memory bandwidth, as you basically need to go through entire KV cache for each new token. So compression can make it faster - you will use more GPU compute but less memory bandwidth for attention calculation