Hacker News new | ask | show | jobs
by kadoban 384 days ago
AlphaGo Zero is: assume you have a neural network that, given a board position, will answer: what's win probability, and how interesting is each move from here.

You use the followup moves as places to search down. It's a multi-armed bandit problem choosing which move(s) to explore down, but for simplicity in explanation you can just say: maybe just search the top few, vaguely in proportion to how interesting they are (the number the net gave you, updated if you find any surprises).

To search down further, you just play that move and then ask the network for the winrate (and followup moves) again. If there's any surprises, you can update upwards to say "hey this is better than expected!" or whatever.

The key thing for training this network: spending computation from an existing network gives ycu better training data to train that same network. So you can start from scratch and use reinforcement learning to improve it without bound.