| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sillysaurusx 2355 days ago
	It should be similarly efficient. AlphaZero used 1,000 TPUv1's to generate self-play games, and a much smaller number of TPUs to train the model on the previous self-play results. Whenever it generated a model that was >= 55% better, that became the new model. The same algorithm could be applied here.

1 comments

jeffshek 2355 days ago

It would not be close to similarly efficient. They have completely different loss functions.

link

sillysaurusx 2355 days ago

You're right, "efficient" should be substituted with "possible". We're certainly not claiming that this is a smart way to do it, just that you can.

Still, I think that there's a chance it could work well. Each move could be prefixed with the final outcome of the game, which is the technique either alphazero or muzero uses.

link