| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kavalg 444 days ago
	My (possibly wrong) TLDR: TransMLA is a method to "compress" an already trained GQA model, with the additional option to further fine tune it. Shall make inference faster.

2 comments

yorwba 444 days ago

It is not a method to compress a Grouped-Query Attention model, but to expand it into an equivalent Multi-head Latent Attention model with the same key-value cache size but larger effective key/value vectors and a correspondingly larger number of trainable parameters. With additional training, you can then obtain a better model that only uses a little bit more memory.

link

kavalg 443 days ago

Thanks for the clarification.

link

freeqaz 444 days ago

Also makes models smarter ("expressive")

link