| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by adamnemecek 1225 days ago
	I’m talking about attention only transformers. Those don’t have an autodiff but still learn. The math is actually really cool.

1 comments

lostmsu 1225 days ago

> attention only transformers

Can you share any good link on the subject?

link

adamnemecek 1225 days ago

https://transformer-circuits.pub/2021/framework/index.html

link

lostmsu 1224 days ago

Maybe I am missing something, but I don't see any learning without autodiff.

link

adamnemecek 1223 days ago

I thought you were asking about attention only transformers. This paper touches on some of it https://arxiv.org/abs/2212.10559v2.

link

lostmsu 1221 days ago

The paper speculates that it is analogous to gradient descent and empirically confirms it is similar in behavior, but it is not a rigorous proof of any kind.

The momentum experiment they made also does not seem related. E.g. it just adds past values to V, which extends the effective context length.

link

adamnemecek 1220 days ago

> but it is not a rigorous proof of any kind.

Such is the nature of early theories.

link