|
|
|
|
|
by f_devd
708 days ago
|
|
> How much is the flash attention algorithm tied to the hardware? The original FA, almost none. For the latest versions depends on your abstraction, ThunderKittens[0] provides about the same speed up over FA2 (1.3x-2x%) as the article but relatively universal across GPUs. For any new hardware there may be hardware specific features that make it edge out more performance; usually vendors will adopt any new features that seems to beat them, but you do get fragmented API/libraries (which is already true for CUDA). [0]: https://hazyresearch.stanford.edu/blog/2024-05-12-tk |
|
[0] https://github.com/HazyResearch/ThunderKittens?tab=readme-ov...
[1] https://github.com/vosen/ZLUDA