|
|
|
|
|
by bbcc90
513 days ago
|
|
(trying to move the critique beyond the title...) When trying to deploy llms in with larger context windows constrained environments 2 things start to hurt:
a) increased memory footprint for longer KV cache
b) increased decode speed due to longer context window.
this paper addresses a) only, which is useful, but we are still left with b) (right?) |
|
> These variants illustrate TPA’s versatility in balancing memory cost, computational overhead, and representation power. By choosing which dimensions (heads or tokens) remain contextual and adjusting ranks (RQ, RK, RV ), TPA unifies multiple existing attention mechanisms— such as MHA, MQA, and GQA—under one framework, while potentially reducing the KV cache size by an order of magnitude during autoregressive inference.
re: the title, it might be the true one if their proofs hold up
---
I'm now curious if the Element-wise Attention is All You Need preprint can be fit into this framework. Sadly my math is not currently up to the task. It appears to offer even better computational savings during both training and inference while maintaining accuracy, though only tested with a smaller model
https://arxiv.org/abs/2501.05730