| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ithkuil 531 days ago
	Do you mean training each model independently and only averaging at a late stage, in order to reduce communication overhead in the distributed scenario?

1 comments

magicalhippo 531 days ago

I was considering this but thought perhaps that would be too resource intensive or not efficient enough.

So in this case I was more thinking if there were perhaps a few directions that stood out, and instead of potentially rejecting those, consider each a microfine-tune, averaging the result at each step ala the uniform soup.

Though perhaps a stupid idea, I'm not a practitioner.

link

fchaubard 531 days ago

I think about designing your ideal solver. What do you want in a solver. I want my solver to squeeze all the juice out of the train that it possibly can and no more. If your problem is complete noise, I don’t want my solver getting 100% train accuracy as all SGD methods I am aware of do and as the soup method likely would as well as I am not averaging memorization thetas. I want my solver to score 0% train accuracy as GAF does. There may be other ways of getting there as well.

link