Hacker News new | ask | show | jobs
by kavalg 397 days ago
My (possibly wrong) TLDR: TransMLA is a method to "compress" an already trained GQA model, with the additional option to further fine tune it. Shall make inference faster.
2 comments

It is not a method to compress a Grouped-Query Attention model, but to expand it into an equivalent Multi-head Latent Attention model with the same key-value cache size but larger effective key/value vectors and a correspondingly larger number of trainable parameters. With additional training, you can then obtain a better model that only uses a little bit more memory.
Thanks for the clarification.
Also makes models smarter ("expressive")