| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by yorwba 403 days ago
	It is not a method to compress a Grouped-Query Attention model, but to expand it into an equivalent Multi-head Latent Attention model with the same key-value cache size but larger effective key/value vectors and a correspondingly larger number of trainable parameters. With additional training, you can then obtain a better model that only uses a little bit more memory.

1 comments

Thanks for the clarification.