|
|
|
|
|
by cygaril
1644 days ago
|
|
Standard RL algorithms will converge to optimal play versus a fixed opponent, but will not find an optimal policy via self play. One intuitive way to see this is that a sequence of improving pure policies A < B < C < etc. will converge to optimal play in a perfect information game like chess, but not necessarily in an imperfect information game like rock/paper/scissors where Rock < Paper < Scissors < Rock, etc |
|