|
|
|
|
|
by jasisz
3930 days ago
|
|
In your example script yes, but basic UCT does not do that - simply because UCT was not meant only for multi-player games in the beginning. And this is some assumption we make about our opponents (actually that they want to maximize their payout, not to e.g. win or minimize our score). Of course this is a very straightforward "application" of UCT or MCTS to the multiplayer games that was done in works of Sturtevant and Cazenave in 2008. But it is not so easy to know about this, e.g. this is a very recent change to wikipedia page on the topic
https://en.wikipedia.org/w/index.php?title=Monte_Carlo_tree_... and very often people writing on the topic are not pointing this out at all, which I find very strange and misleading. Also in the classic MCTS you should select move which has most visits, not the one with the highest percentage of wins. |
|
* The choice of action to be returned is the one "with the highest average observed long-term reward"
* For simplicity, the payout value used in the paper is 1=win and 0=loss, which will result in the agents maximizing their wins. Presumably one could choose other payout values (i.e. points for that player in games that have that concept) to adjust the priority of the agents. The mathematics does not seem to forbid it.
* This paper uses as their first experiment a multi-player game. They state, "...for P-games UCT is modified to a negamax-style: In MIN nodes the negative of estimated action-values is used in the action selection procedures." It is straightforward and more generalizable to games with N > 2 players to simply record values from that player's standpoint in the first place, instead of manipulating it after the fact in this manner.
I hope this clarifies some things.