|
|
|
|
|
by menaerus
513 days ago
|
|
> It's usually the case that memory access is also slower than matrix multiplications so this will be faster. Burning flops to save memory movement. I haven't read this paper (yet) but isn't this the case that mostly applies to training and not so much to inference? A good example would be flash-attention, it trades the higher flops for better memory utilization but it's mostly irrelevant in inference workloads. |
|