Hacker News new | ask | show | jobs
by imtringued 560 days ago
This is only relevant for the flash attention part of the transformer, but a NPU is an equally suitable replacement for a GPU for flash attention.

Once you have offloaded flash attention, you're back to GEMV having a memory bottleneck. GEMV does a single multiplication and addition per parameter. You can add as many EXAFLOPs as you want, it won't get faster than your memory.

1 comments

Out of interest, how does that look for diffusion?