Hacker News new | ask | show | jobs
by YeGoblynQueenne 2396 days ago
>> The MuZero training does not use MCTS, it merely observes sequences of moves/states/rewards.

I'm sorry, I read the paper a bit more carefully since we're discussing it and I don't think this is right. It's true that it's a while since I read the AlphaZero paper and the details are a bit fuzzy in my memory, but in the MuZero paper it's clear that MCTS is used to generate a policy and estimated value for a current hidden state, and to select an action to take at the current real game state (the "environment"), then the observed state and reward are later reused as past observations to train the model, together with future actions, also selected by MCTS. So it seems to me that MCTS is pretty central to the training process.

The paper does say that any MDP could be used in place of MCTS but I don't think anyone seriously plans on using something else than MCTS for board games in the foreseeable future.

I'm confused by your use of the term "implicit tree search". Could you clarify?