|
|
|
|
|
by imtringued
560 days ago
|
|
This is only relevant for the flash attention part of the transformer, but a NPU is an equally suitable replacement for a GPU for flash attention. Once you have offloaded flash attention, you're back to GEMV having a memory bottleneck. GEMV does a single multiplication and addition per parameter. You can add as many EXAFLOPs as you want, it won't get faster than your memory. |
|