|
|
|
|
|
by shawntan
816 days ago
|
|
You're both kinda right. The type of computation that happens for that attention step that you refer to is parallel. I would say the thing that is "constant" is the computation graph depth (the number of sequential computations) which is actually important in computing certain functions. https://blog.wtf.sg/posts/2023-02-03-the-new-xor-problem/ |
|
Flash attention, which is widely used, is no longer parallel. The attention matrix is solved batch by batch.