Hacker News new | ask | show | jobs
by praptak 384 days ago
"pick the path you sampled most" is misleading.

What you actually do is model every node (a game state) as a multi armed bandit. Moves are levers and the final game results are payoffs.

So you basically keep a tree of multi-armed bandits and adjust it after each (semi-)random game, perhaps adding some nodes, for example the first node the game visited which is not yet in your move tree.

For the random game you pick the next node to maximise long term payoff (exploration/exploitation tradeoff applies here) which usually means a move which gave good win ratio on previous plays but not always (exploration).

And obviously this only applies to the first part of the game which is still in the memorized tree - after that it's random.

This alone does converge to a winning strategy but sometimes impractically slowly. Here's where the neural network comes in - in every new node assign the weights not uniformly but rather directed by the NN which seeks out promising moves and greatly speeds up the convergence.