|
|
|
|
|
by lostmsu
1221 days ago
|
|
The paper speculates that it is analogous to gradient descent and empirically confirms it is similar in behavior, but it is not a rigorous proof of any kind. The momentum experiment they made also does not seem related. E.g. it just adds past values to V, which extends the effective context length. |
|
Such is the nature of early theories.