Hacker News new | ask | show | jobs
by f_devd 708 days ago
> How much is the flash attention algorithm tied to the hardware?

The original FA, almost none.

For the latest versions depends on your abstraction, ThunderKittens[0] provides about the same speed up over FA2 (1.3x-2x%) as the article but relatively universal across GPUs. For any new hardware there may be hardware specific features that make it edge out more performance; usually vendors will adopt any new features that seems to beat them, but you do get fragmented API/libraries (which is already true for CUDA).

[0]: https://hazyresearch.stanford.edu/blog/2024-05-12-tk

2 comments

What do you mean by "relatively universal"? This is Cuda only [0] with a promise of a rocm backend eventually. There's only one project I'm aware of that seriously tries to address the Cuda issue in ml [1].

[0] https://github.com/HazyResearch/ThunderKittens?tab=readme-ov...

[1] https://github.com/vosen/ZLUDA

If you read the article I linked they show that it's entirely based on 16x16 matrices (or "tiles") which is fairly standard across gpus.
I mean they're building an API to abstract away some of the SKU-to-SKU differences, but the broader point cuts the other way, I think:

> In fact, more broadly we believe we should really reorient our ideas of AI around what maps well onto the hardware. How big should a recurrent state be? As big can fit onto an SM. How dense should the compute be? No less so than what the hardware demands. An important future direction of this work for us is to use our learnings about the hardware to help us design the AI to match.

The value is in adapting the implementation (either manually at write-time or programmatically at run-time) to the specifics of the hardware.

Also, great line:

> And we ask: if your matrix multiply is smaller than 16x16, are you sure what you’re doing is AI?