Hacker News new | ask | show | jobs
by theGnuMe 1250 days ago
It makes sense that all gradients are local. Does it make sense to say that gradient propagation through the layers is memoryless?
1 comments

In my opinion, yes if and only if the update does not use a stateful optimiser, and the computation is easy / simple enough that the updated parameter value can be computed immediately.

In linear layers, it is possible. Once you have computed the gradient of the output of the vector ith vector, so a scalar, you scale the input by that value and add it to the parameters.

This is a simple FMA op: a=fma(eta*z, x, a), with z the gradient of the vector, x the input, a the parameters, and eta the learning rate. This computes a = a + eta*z*x in place.