|
|
|
|
|
by jebarker
819 days ago
|
|
This feels like a "no free lunch" situation. I would imagine that any time saving in approximating the gradients this way would be lost to needing to train for more iterations due to the loss in gradient accuracy. Is that not the case? |
|
However if this is really the biological analogue of credit assignment, this might scale better than training llms from scratch every time. Even if say it could approx gradients to a certain degree given a new network, normal backprop could further tune for a few epochs or so dramatically reducing overall training costs.