|
|
|
|
|
by refibrillator
708 days ago
|
|
The code has a comment which seems to hint that Tri Dao was working on FA3 as early as April 2022, the month after Hopper/H100 was announced. I find it mildly curious that over 2 years has elapsed before the code was released today. Perhaps it’s because now there’s better solutions in the pipeline? Tri’s publication history has been leaning toward SSM and Mamba style architectures recently. Unlike Flash Attention which has quadratic time complexity wrt sequence length, these latest algorithms are subquadratic. Thus they do much less computation, instead of just doing it more efficiently a la Flash Attention. Dao and Gu published a really long paper this year which demonstrated (among other things) how Mamba/SSM can be formulated such that it’s amenable to acceleration using the same hardware primitives that Transformers benefit from. |
|
As (dis-)proving SETH will resolve the P vs NP problem, I wouldn't hold my breath.
The question is if a particular use case can accept those costs.