|
|
|
|
|
by magnio
850 days ago
|
|
Fantastic blog post, thank you for this. I am not even familiar with transformers, yet the explanation is stellar clear to me, and the included references and context are a trasure trove. The explanation of FlashAttention is the best I have seen, and that is not even the focus of the article. One question I have on selectivity: footnote 4 says "the continuous A is constant, while our discretization parameter ∆ is input-dependent." What is the effect of varying the discretization instead of the (main, as I understand it) state A? My gut says it simplifies training and provides stability, but I feel A carries most of the behavior of the model, so it should have more wiggle room throughout training. |
|
“We remark that while the A parameter could also be selective, it ultimately affects the model only through its interaction with ∆ via A = exp(∆A) (the discretization (4)). Thus selectivity in ∆ is enough to ensure selectivity in (A, B), and is the main source of improvement. We hypothesize that making A selective in addition to (or instead of) ∆ would have similar performance, and leave it out for simplicity.”