|
|
|
|
|
by samber
9 days ago
|
|
Comparing compute cost versus FlashAttention-2 is not very honest to me. FlashAttention-2 is not used anymore for at least 2y. This architecture would have been a massive improvement 3 years ago, but it is a ~solved~ problem IMO. |
|