Hacker News new | ask | show | jobs
by ithkuil 531 days ago
Do you mean training each model independently and only averaging at a late stage, in order to reduce communication overhead in the distributed scenario?
1 comments

I was considering this but thought perhaps that would be too resource intensive or not efficient enough.

So in this case I was more thinking if there were perhaps a few directions that stood out, and instead of potentially rejecting those, consider each a microfine-tune, averaging the result at each step ala the uniform soup.

Though perhaps a stupid idea, I'm not a practitioner.

I think about designing your ideal solver. What do you want in a solver. I want my solver to squeeze all the juice out of the train that it possibly can and no more. If your problem is complete noise, I don’t want my solver getting 100% train accuracy as all SGD methods I am aware of do and as the soup method likely would as well as I am not averaging memorization thetas. I want my solver to score 0% train accuracy as GAF does. There may be other ways of getting there as well.