|
|
|
|
|
by WithinReason
616 days ago
|
|
Hmmm, this could be expressed as 2 consecutive attentions in a residual branch: Simplified differential T. looks like: (softmax(Q₁K₁) − λ softmax(Q₂K₂)) V You can factor this into: x = softmax(Q₁K₁)V
x += -λ softmax(Q₂K₂)V
which is like 2 subsequent regular attentions added that are sharing V |
|