Hacker News new | ask | show | jobs
by verdverm 515 days ago
The more meaningful contribution may be (section 3.4)

> These variants illustrate TPA’s versatility in balancing memory cost, computational overhead, and representation power. By choosing which dimensions (heads or tokens) remain contextual and adjusting ranks (RQ, RK, RV ), TPA unifies multiple existing attention mechanisms— such as MHA, MQA, and GQA—under one framework, while potentially reducing the KV cache size by an order of magnitude during autoregressive inference.

re: the title, it might be the true one if their proofs hold up

---

I'm now curious if the Element-wise Attention is All You Need preprint can be fit into this framework. Sadly my math is not currently up to the task. It appears to offer even better computational savings during both training and inference while maintaining accuracy, though only tested with a smaller model

https://arxiv.org/abs/2501.05730

2 comments

EA doesn't quite fit in the same umbrella. EA has a constant cache size (it's just another classical recurrent architecture inspired by approximating transformers), where this paper gives speedups to a variety of true attention mechanisms which still require caches to be proportional to the sequence length.
very succinct and insightful, thank you!
Curious to know what mathematics you are comfortable with. If you are able to understand the papers you mentioned, you must belong to the 99 percentile.
I was never good at proof writing. I found group theory and algebra interesting, topology and analysis eluded me. It's just been a while since I did any serious math thinking