|
|
|
|
|
by jeffbradberry
3934 days ago
|
|
In what appears to be the originating paper for UCT ("Bandit based Monte-Carlo Planning", L. Kocsis and C. Szepesvári, 2006), * The choice of action to be returned is the one "with the highest average observed long-term reward" * For simplicity, the payout value used in the paper is 1=win and 0=loss, which will result in the agents maximizing their wins. Presumably one could choose other payout values (i.e. points for that player in games that have that concept) to adjust the priority of the agents. The mathematics does not seem to forbid it. * This paper uses as their first experiment a multi-player game. They state, "...for P-games UCT is modified to a negamax-style: In MIN nodes the negative of estimated action-values is used in the action selection procedures." It is straightforward and more generalizable to games with N > 2 players to simply record values from that player's standpoint in the first place, instead of manipulating it after the fact in this manner. I hope this clarifies some things. |
|
I think this is kind of a clear statement that original paper (and after it a lot of writing on the topic) may be lacking. Of course people used this simple generalization before and it is pretty straightforward, but it is not that obvious at a first glance. And I've seen quite a lot of code examples, images explaining UCT for games and articles that were just not saying a word on this. Or even worse - just doing it wrong for multiplayer games.
Choice of action is a different topic, as I remember correctly there was also a paper proving that win rate and most robust branch are in the end performing the same ;)
Hope you will continue this series, because it is really good and code examples are really nice!