| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gwern 2391 days ago

> Specifically, MuZero uses MCTS and MCTS needs to have at the very least a move generator in order to produce actions that can then be evaluated for their results.

You are confusing the two phases. The MuZero training does not use MCTS, it merely observes sequences of moves/states/rewards. This can be done using observations from anywhere: human games, AG games, A0 games, random games. This is where it does the actual learning of what moves are valid and what makes moves good (because invalid moves will not be represented in the dataset of valid games). It does not need MCTS or any access to an oracle about move validity, which is Marcus's complaint. This is no more cheating than observing the real world to infer its physics.

The second phase, where new games are generated, may use MCTS. But it doesn't have to. So it can learn by simply training on a game corpus, and then generating a new game corpus by self-play using only its internal implicit tree search and something like illegal moves = instant loss. It will rapidly learn to not make illegal moves and play just as validly as a MCTS-structured tree search, and then its implicit learned tree search achieves the same or greater playing strength.

1 comments

YeGoblynQueenne 2390 days ago

>> The MuZero training does not use MCTS, it merely observes sequences of moves/states/rewards.

I'm sorry, I read the paper a bit more carefully since we're discussing it and I don't think this is right. It's true that it's a while since I read the AlphaZero paper and the details are a bit fuzzy in my memory, but in the MuZero paper it's clear that MCTS is used to generate a policy and estimated value for a current hidden state, and to select an action to take at the current real game state (the "environment"), then the observed state and reward are later reused as past observations to train the model, together with future actions, also selected by MCTS. So it seems to me that MCTS is pretty central to the training process.

The paper does say that any MDP could be used in place of MCTS but I don't think anyone seriously plans on using something else than MCTS for board games in the foreseeable future.

I'm confused by your use of the term "implicit tree search". Could you clarify?