| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by LukeB42 1225 days ago
	Can you explain like I'm 5 why this matters distinctly from how transformers are normally trained with autodiff and what its possible applications are?

1 comments

adamnemecek 1225 days ago

I’m talking about attention only transformers. Those don’t have an autodiff but still learn. The math is actually really cool.

link

lostmsu 1224 days ago

> attention only transformers

Can you share any good link on the subject?

link

adamnemecek 1224 days ago

https://transformer-circuits.pub/2021/framework/index.html

link

lostmsu 1224 days ago

Maybe I am missing something, but I don't see any learning without autodiff.

link

adamnemecek 1223 days ago

I thought you were asking about attention only transformers. This paper touches on some of it https://arxiv.org/abs/2212.10559v2.

link

lostmsu 1221 days ago

The paper speculates that it is analogous to gradient descent and empirically confirms it is similar in behavior, but it is not a rigorous proof of any kind.

The momentum experiment they made also does not seem related. E.g. it just adds past values to V, which extends the effective context length.

link