| This is probably a dumb idea, but I'll air it anyway. I just stumbled upon the model soup[1] paper where, as I understand it, they average weights of fine-tuned models and get an model that performs better. They have a more involved algorithm but even the uniform soup (simple weight average) seems to perform well. In your paper you mention that especially in late stages the gradients of microbatches are often not aligned, hence the agreement filtering. What you're doing, from my brief glossing over, is effectively to do a k-means clustering with outlier rejection pass with k = 1. You then use the cluster mean to update the model. What I'm curious of, assuming the above is correct, is what would happen if you combined the approaches. That is, do a k-means clustering of the microbatch gradients with k > 1, still rejecting outliers but perhaps with lower threshold, generate k updated models using the k cluster means, and then average the k models afterwards. I've used something similar to k-means clustering with outlier rejection for noise filtering and it was quite effective, so curious how it would work out here. [1]: https://arxiv.org/abs/2203.05482 |
I think the idea of averaging the k-models afterwards though is at odds with the core concept of gradient agreement filtering though because you're back at combining two distinct directions of improvement without a guarantee that the combination is better (even though it does seem to be in practice). The the core idea is that you philosophically only want to learn the patterns that agree across multiple specific examples and build some some algorithmic protections to ensure that is happening. Just averaging, while it might work and even yield improvement, but it doesn't necessarily lead to proper generalized learning.