Hacker News new | ask | show | jobs
by adamnemecek 1232 days ago
It turns out that transformers have a learning mechanism similar to autodiff but better since it happens mostly within the single layers as opposed to over the whole graph. I wrote a paper on this recently https://arxiv.org/abs/2302.01834v1. The math is crazy.
4 comments

"Combinatorial Hopf" would make an excellent beer name!

"Bartender! A half-pint of your finest Combinatorial Hopf, if you please!"

Can you explain like I'm 5 why this matters distinctly from how transformers are normally trained with autodiff and what its possible applications are?
I’m talking about attention only transformers. Those don’t have an autodiff but still learn. The math is actually really cool.
> attention only transformers

Can you share any good link on the subject?

Maybe I am missing something, but I don't see any learning without autodiff.
I thought you were asking about attention only transformers. This paper touches on some of it https://arxiv.org/abs/2212.10559v2.
First question: why should the attention mechanism output and residual stream match?
Match is a bad word, the don’t match, they are duals. The residual stream aka identity mapping needs to be the identity of the attention mechanism as the attention mechanism learns.

But this is the same for all residual streams, not just those in transformers.

Join my discord to discuss this further https://discord.gg/mr9TAhpyBW

Wait-- the residual stream makes the attention mechanism learn the difference from the identity! Are you sure you're not thinking about auto-encoders?

Edit: ok, Discord it is.

I don’t believe autodiff is finding the difference in that sense. It’s finding derivatives.
Well, the paper uses gradient descent to minimize that difference, like auto-encoders do.
Gradient descent is just how neural networks (including auto-encoders) optimize parameters to minimize the loss function. They do this using derivatives to descend down the slope of the function. Autodiff is one way to compute the derivatives. Maybe we’re saying the same thing.
Do you see a similarity between residual stream and Dirac function?
Can this all be done on the GPU so the CPU doesn't need to be involved to adjust the weights?