| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by adamnemecek 1232 days ago
	It turns out that transformers have a learning mechanism similar to autodiff but better since it happens mostly within the single layers as opposed to over the whole graph. I wrote a paper on this recently https://arxiv.org/abs/2302.01834v1. The math is crazy.

4 comments

tpoacher 1232 days ago

"Combinatorial Hopf" would make an excellent beer name!

"Bartender! A half-pint of your finest Combinatorial Hopf, if you please!"

link

LukeB42 1232 days ago

Can you explain like I'm 5 why this matters distinctly from how transformers are normally trained with autodiff and what its possible applications are?

link

adamnemecek 1232 days ago

I’m talking about attention only transformers. Those don’t have an autodiff but still learn. The math is actually really cool.

link

lostmsu 1231 days ago

> attention only transformers

Can you share any good link on the subject?

link

adamnemecek 1231 days ago

https://transformer-circuits.pub/2021/framework/index.html

link

lostmsu 1230 days ago

Maybe I am missing something, but I don't see any learning without autodiff.

link

adamnemecek 1230 days ago

I thought you were asking about attention only transformers. This paper touches on some of it https://arxiv.org/abs/2212.10559v2.

link

macrolocal 1232 days ago

First question: why should the attention mechanism output and residual stream match?

link

adamnemecek 1232 days ago

Match is a bad word, the don’t match, they are duals. The residual stream aka identity mapping needs to be the identity of the attention mechanism as the attention mechanism learns.

But this is the same for all residual streams, not just those in transformers.

Join my discord to discuss this further https://discord.gg/mr9TAhpyBW

link

macrolocal 1232 days ago

Wait-- the residual stream makes the attention mechanism learn the difference from the identity! Are you sure you're not thinking about auto-encoders?

Edit: ok, Discord it is.

link

crosen99 1231 days ago

I don’t believe autodiff is finding the difference in that sense. It’s finding derivatives.

link

macrolocal 1231 days ago

Well, the paper uses gradient descent to minimize that difference, like auto-encoders do.

link

crosen99 1231 days ago

Gradient descent is just how neural networks (including auto-encoders) optimize parameters to minimize the loss function. They do this using derivatives to descend down the slope of the function. Autodiff is one way to compute the derivatives. Maybe we’re saying the same thing.

link

adamnemecek 1231 days ago

Do you see a similarity between residual stream and Dirac function?

link

naasking 1231 days ago

Can this all be done on the GPU so the CPU doesn't need to be involved to adjust the weights?

link