Hacker News new | ask | show | jobs
by fxtentacle 1654 days ago
You can mathematically prove for a lot of different algorithms (including PPO, DQN, IMPALA) that given enough experience with the game world, they will eventually converge to the optimal policy. It's just that the "enough experience" part might be so large that it's practically useless.

If I remember correctly, the DeepMind x UCL RL Lecture Series proves the underlying Bellman equation in this video: https://www.youtube.com/watch?v=zSOMeug_i_M

As for "hidden information" games, I thought the trick was to concatenate the current state with all past states and treat that as the new state, thereby making it an MDP.

1 comments

I don't think you can prove that (forgive me if I don't sit through a 2h video). Those all are susceptible to the deadly triad, and AFAIK there are no convergence proofs of any kind for the big model-free DL algs, and it would've been big news if someone had proved that a real-world version of PPO/DQN/IMPALA does in fact converge in the limit. Sutton's book and earlier proofs only cover cases where you drop the nonlinear approximator or something.

(History stacking may turn POMDPs into MDPs, but I don't know if they handle the specially adversarial nature of games like poker. That's quite different from stacking ALE frames.)

Standard RL algorithms will converge to optimal play versus a fixed opponent, but will not find an optimal policy via self play.

One intuitive way to see this is that a sequence of improving pure policies A < B < C < etc. will converge to optimal play in a perfect information game like chess, but not necessarily in an imperfect information game like rock/paper/scissors where Rock < Paper < Scissors < Rock, etc