|
|
|
|
|
by Scene_Cast2
537 days ago
|
|
Couple of questions. First, would you happen to have a code demo? Second, and this is more of a hypothetical question for my own understanding rather than a practical one - in a single GPU scenario, could you take compute the loss per-sample without averaging (i.e. "reduce=None" in pytorch), and improve (on a sample efficiency basis) single GPU training with your algorithm? Sorry if this was covered in the paper already. |
|
My loss metrics stay roughly the same (they're slightly lower, but SD loss is fraught to interpret because variance by timestep renders it more or less meaningless), but tracking the means of `param.grad.norm / param.numel` (which shows how big the grad updates are) shows the grads stabilizing significantly quicker than baseline. I'm tracking suppressed params / total params via tensorboard, and I show that it drops (as expected) but then stabilizes at around 7%, suggesting that there are model parameters which consistently don't agree. I'm gonna try tracking the variance from the mean, as well, and perhaps down-weight or eliminate grads for parameters which show high cos similarity variance over time (suggesting a generalized lack of agreement in the direction to move, further suggesting that the parameter cannot contribute meaningfully to the task).