|
|
|
|
|
by PartiallyTyped
984 days ago
|
|
I mean ... if you think about it, attention changes the effective weights of a model. I am fairly certain that if you try, you can show that for any particular sequence of tokens of length N, the N-1 tokens induce a residual FFNN that results in exactly the same distribution over the next tokens given just the Nth. |
|