Hacker News new | ask | show | jobs
by bbcc90 513 days ago
(trying to move the critique beyond the title...)

When trying to deploy llms in with larger context windows constrained environments 2 things start to hurt: a) increased memory footprint for longer KV cache b) increased decode speed due to longer context window. this paper addresses a) only, which is useful, but we are still left with b) (right?)

3 comments

The more meaningful contribution may be (section 3.4)

> These variants illustrate TPA’s versatility in balancing memory cost, computational overhead, and representation power. By choosing which dimensions (heads or tokens) remain contextual and adjusting ranks (RQ, RK, RV ), TPA unifies multiple existing attention mechanisms— such as MHA, MQA, and GQA—under one framework, while potentially reducing the KV cache size by an order of magnitude during autoregressive inference.

re: the title, it might be the true one if their proofs hold up

---

I'm now curious if the Element-wise Attention is All You Need preprint can be fit into this framework. Sadly my math is not currently up to the task. It appears to offer even better computational savings during both training and inference while maintaining accuracy, though only tested with a smaller model

https://arxiv.org/abs/2501.05730

EA doesn't quite fit in the same umbrella. EA has a constant cache size (it's just another classical recurrent architecture inspired by approximating transformers), where this paper gives speedups to a variety of true attention mechanisms which still require caches to be proportional to the sequence length.
very succinct and insightful, thank you!
Curious to know what mathematics you are comfortable with. If you are able to understand the papers you mentioned, you must belong to the 99 percentile.
I was never good at proof writing. I found group theory and algebra interesting, topology and analysis eluded me. It's just been a while since I did any serious math thinking
It addresses b too since decompositions are always smaller than the original tensor. It's usually the case that memory access is also slower than matrix multiplications so this will be faster. Burning flops to save memory movement.
> It's usually the case that memory access is also slower than matrix multiplications so this will be faster. Burning flops to save memory movement.

I haven't read this paper (yet) but isn't this the case that mostly applies to training and not so much to inference? A good example would be flash-attention, it trades the higher flops for better memory utilization but it's mostly irrelevant in inference workloads.

They claim an inference time savings to the kv cache
I skimmed through the paper real quickly. There's no performance data on inference speedups in the paper. Only the benchmarks relevant for training.

They also, interestingly, don't compare against the flash-attention. Flash-attention outperforms all of the other attention mechanisms mentioned in the paper: MHA, MQA, GQA, and MLA.

Flash attention is an implementation trick; you can implement MHA/GQA, for example, with flash attention.
They aren't claiming speedups, they are claiming up to an order of magnitude less space needed for the kv cache at runtime. This translates to a smaller GPU or longer sequences in the same GPU
Under what circumstances can you cut down your LOADS and STORE from and to main memory by an order of magnitude without observing major improvements in algorithm runtime that is memory-bound?
> (trying to move the critique beyond the title...)

This is kind of a theme in HN now. The top comments are completely besides the point of the article/story/etc.

I know. It is sad. Naming can also be seen as a way of showing respect to a hugely impactful paper if you want to be positive about it.