Hacker News new | ask | show | jobs
by sillysaurusx 2355 days ago
It should be similarly efficient. AlphaZero used 1,000 TPUv1's to generate self-play games, and a much smaller number of TPUs to train the model on the previous self-play results. Whenever it generated a model that was >= 55% better, that became the new model.

The same algorithm could be applied here.

1 comments

It would not be close to similarly efficient. They have completely different loss functions.
You're right, "efficient" should be substituted with "possible". We're certainly not claiming that this is a smart way to do it, just that you can.

Still, I think that there's a chance it could work well. Each move could be prefixed with the final outcome of the game, which is the technique either alphazero or muzero uses.