| HN Mirror

In my opinion, yes if and only if the update does not use a stateful optimiser, and the computation is easy / simple enough that the updated parameter value can be computed immediately.

In linear layers, it is possible. Once you have computed the gradient of the output of the vector ith vector, so a scalar, you scale the input by that value and add it to the parameters.

This is a simple FMA op: a=fma(eta*z, x, a), with z the gradient of the vector, x the input, a the parameters, and eta the learning rate. This computes a = a + eta*z*x in place.