| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by imtringued 560 days ago
	This is only relevant for the flash attention part of the transformer, but a NPU is an equally suitable replacement for a GPU for flash attention. Once you have offloaded flash attention, you're back to GEMV having a memory bottleneck. GEMV does a single multiplication and addition per parameter. You can add as many EXAFLOPs as you want, it won't get faster than your memory.

1 comments

Out of interest, how does that look for diffusion?