|
|
|
|
|
by Scene_Cast2
384 days ago
|
|
This is one of those topics that LLMs (Opus 4, Gemini 2.5 pro, etc) seem bad at explaining. I was trying to figure out the difference between the Stockfish approach (minimax, alpha-beta pruning) versus Alpha Zero / Leela Chess Zero (MCTS). My very crude understanding is that stockfish has a very light & fast neural net and goes for a very thorough search. Meanwhile, in MCTS (which I don't really understand at this point), you eval the neural net, sample some paths based on the neural net (similar to minimax), and then pick the path you sampled the most. There's also the training vs eval aspect to it. Would love a better explanation. |
|
What you actually do is model every node (a game state) as a multi armed bandit. Moves are levers and the final game results are payoffs.
So you basically keep a tree of multi-armed bandits and adjust it after each (semi-)random game, perhaps adding some nodes, for example the first node the game visited which is not yet in your move tree.
For the random game you pick the next node to maximise long term payoff (exploration/exploitation tradeoff applies here) which usually means a move which gave good win ratio on previous plays but not always (exploration).
And obviously this only applies to the first part of the game which is still in the memorized tree - after that it's random.
This alone does converge to a winning strategy but sometimes impractically slowly. Here's where the neural network comes in - in every new node assign the weights not uniformly but rather directed by the NN which seeks out promising moves and greatly speeds up the convergence.