| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lxe 708 days ago
	> FlashAttention-3 is optimized for Hopper GPUs (e.g. H100). How does FA3 fare for consumer GPUs such as 3090 and 4090?

1 comments

apsec112 708 days ago

It's Hopper-specific, the improvements are closely tied to Hopper features like warp groups and TMA. For 4090s, you might get a speedup by using the Triton implementation of FP8 attention: https://triton-lang.org/main/getting-started/tutorials/06-fu...

link

moffkalast 708 days ago

The original flash attention (v1?) took like a year to get added to llama.cpp and only provides single digit percent VRAM savings for typical context lengths and practically no speed boost. Still nice to have, but man was this thing overhyped. I doubt v3 will do more than marginally better on the RTX 5000 series.

link

apsec112 708 days ago

On GPU, or on CPU/Metal? For the latter I'm not surprised, but that's because they have a totally different memory/cache hierarchy.

link

moffkalast 708 days ago

With CUDA offloading, I don't think it runs otherwise at all.

link