| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by llm_trw 513 days ago
	It addresses b too since decompositions are always smaller than the original tensor. It's usually the case that memory access is also slower than matrix multiplications so this will be faster. Burning flops to save memory movement.

1 comments

menaerus 513 days ago

> It's usually the case that memory access is also slower than matrix multiplications so this will be faster. Burning flops to save memory movement.

I haven't read this paper (yet) but isn't this the case that mostly applies to training and not so much to inference? A good example would be flash-attention, it trades the higher flops for better memory utilization but it's mostly irrelevant in inference workloads.

link

verdverm 513 days ago

They claim an inference time savings to the kv cache

link

menaerus 513 days ago

I skimmed through the paper real quickly. There's no performance data on inference speedups in the paper. Only the benchmarks relevant for training.

They also, interestingly, don't compare against the flash-attention. Flash-attention outperforms all of the other attention mechanisms mentioned in the paper: MHA, MQA, GQA, and MLA.

link

apophis-ren 502 days ago

Flash attention is an implementation trick; you can implement MHA/GQA, for example, with flash attention.

link

verdverm 513 days ago

They aren't claiming speedups, they are claiming up to an order of magnitude less space needed for the kv cache at runtime. This translates to a smaller GPU or longer sequences in the same GPU

link

menaerus 513 days ago

Under what circumstances can you cut down your LOADS and STORE from and to main memory by an order of magnitude without observing major improvements in algorithm runtime that is memory-bound?

link

verdverm 513 days ago

AI models are compute bound, it's why we use GPUs

link