| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lostmsu 1224 days ago
	Maybe I am missing something, but I don't see any learning without autodiff.

1 comments

adamnemecek 1223 days ago

I thought you were asking about attention only transformers. This paper touches on some of it https://arxiv.org/abs/2212.10559v2.

link

lostmsu 1221 days ago

The paper speculates that it is analogous to gradient descent and empirically confirms it is similar in behavior, but it is not a rigorous proof of any kind.

The momentum experiment they made also does not seem related. E.g. it just adds past values to V, which extends the effective context length.

link

adamnemecek 1220 days ago

> but it is not a rigorous proof of any kind.

Such is the nature of early theories.

link