Do you mean training each model independently and only averaging at a late stage, in order to reduce communication overhead in the distributed scenario?
I was considering this but thought perhaps that would be too resource intensive or not efficient enough.
So in this case I was more thinking if there were perhaps a few directions that stood out, and instead of potentially rejecting those, consider each a microfine-tune, averaging the result at each step ala the uniform soup.
Though perhaps a stupid idea, I'm not a practitioner.
I think about designing your ideal solver. What do you want in a solver. I want my solver to squeeze all the juice out of the train that it possibly can and no more. If your problem is complete noise, I don’t want my solver getting 100% train accuracy as all SGD methods I am aware of do and as the soup method likely would as well as I am not averaging memorization thetas. I want my solver to score 0% train accuracy as GAF does. There may be other ways of getting there as well.
So in this case I was more thinking if there were perhaps a few directions that stood out, and instead of potentially rejecting those, consider each a microfine-tune, averaging the result at each step ala the uniform soup.
Though perhaps a stupid idea, I'm not a practitioner.