|
|
|
|
|
by cheald
534 days ago
|
|
As an experiment, I tried implementing this for Stable Diffusion lora training, where I'm training on a single GPU with a batch size of 8, and it does actually seem to have an appreciable impact. In my case, I'm keeping a per-parameter grad EMA, and then computing the cosine distance between the parameter's grad and its EMA, and then multiplying the grad by 0 if (1.0 - cos_sim) > 0.99. My loss metrics stay roughly the same (they're slightly lower, but SD loss is fraught to interpret because variance by timestep renders it more or less meaningless), but tracking the means of `param.grad.norm / param.numel` (which shows how big the grad updates are) shows the grads stabilizing significantly quicker than baseline. I'm tracking suppressed params / total params via tensorboard, and I show that it drops (as expected) but then stabilizes at around 7%, suggesting that there are model parameters which consistently don't agree. I'm gonna try tracking the variance from the mean, as well, and perhaps down-weight or eliminate grads for parameters which show high cos similarity variance over time (suggesting a generalized lack of agreement in the direction to move, further suggesting that the parameter cannot contribute meaningfully to the task). |
|