Hacker News new | ask | show | jobs
by magicalhippo 539 days ago
This is probably a dumb idea, but I'll air it anyway.

I just stumbled upon the model soup[1] paper where, as I understand it, they average weights of fine-tuned models and get an model that performs better. They have a more involved algorithm but even the uniform soup (simple weight average) seems to perform well.

In your paper you mention that especially in late stages the gradients of microbatches are often not aligned, hence the agreement filtering.

What you're doing, from my brief glossing over, is effectively to do a k-means clustering with outlier rejection pass with k = 1. You then use the cluster mean to update the model.

What I'm curious of, assuming the above is correct, is what would happen if you combined the approaches.

That is, do a k-means clustering of the microbatch gradients with k > 1, still rejecting outliers but perhaps with lower threshold, generate k updated models using the k cluster means, and then average the k models afterwards.

I've used something similar to k-means clustering with outlier rejection for noise filtering and it was quite effective, so curious how it would work out here.

[1]: https://arxiv.org/abs/2203.05482

2 comments

So other ways of combining greater numbers of microbatch gradients in an effective/consistent manner for performing an update is one area of potential future work. I think your idea is an interesting way to approach it. Though there are a bunch of potentially effective ways of doing it.

I think the idea of averaging the k-models afterwards though is at odds with the core concept of gradient agreement filtering though because you're back at combining two distinct directions of improvement without a guarantee that the combination is better (even though it does seem to be in practice). The the core idea is that you philosophically only want to learn the patterns that agree across multiple specific examples and build some some algorithmic protections to ensure that is happening. Just averaging, while it might work and even yield improvement, but it doesn't necessarily lead to proper generalized learning.

Do you mean training each model independently and only averaging at a late stage, in order to reduce communication overhead in the distributed scenario?
I was considering this but thought perhaps that would be too resource intensive or not efficient enough.

So in this case I was more thinking if there were perhaps a few directions that stood out, and instead of potentially rejecting those, consider each a microfine-tune, averaging the result at each step ala the uniform soup.

Though perhaps a stupid idea, I'm not a practitioner.

I think about designing your ideal solver. What do you want in a solver. I want my solver to squeeze all the juice out of the train that it possibly can and no more. If your problem is complete noise, I don’t want my solver getting 100% train accuracy as all SGD methods I am aware of do and as the soup method likely would as well as I am not averaging memorization thetas. I want my solver to score 0% train accuracy as GAF does. There may be other ways of getting there as well.