| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by macrolocal 1231 days ago
	Wait-- the residual stream makes the attention mechanism learn the difference from the identity! Are you sure you're not thinking about auto-encoders? Edit: ok, Discord it is.

2 comments

crosen99 1231 days ago

I don’t believe autodiff is finding the difference in that sense. It’s finding derivatives.

link

macrolocal 1231 days ago

Well, the paper uses gradient descent to minimize that difference, like auto-encoders do.

link

crosen99 1231 days ago

Gradient descent is just how neural networks (including auto-encoders) optimize parameters to minimize the loss function. They do this using derivatives to descend down the slope of the function. Autodiff is one way to compute the derivatives. Maybe we’re saying the same thing.

link

macrolocal 1231 days ago

Yep, I was just asking Adam* to justify his loss function.

*pun intended :)

link

adamnemecek 1231 days ago

Do you see a similarity between residual stream and Dirac function?

link