|
|
|
|
|
by llm_trw
513 days ago
|
|
It addresses b too since decompositions are always smaller than the original tensor. It's usually the case that memory access is also slower than matrix multiplications so this will be faster. Burning flops to save memory movement. |
|
I haven't read this paper (yet) but isn't this the case that mostly applies to training and not so much to inference? A good example would be flash-attention, it trades the higher flops for better memory utilization but it's mostly irrelevant in inference workloads.