| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gwern 1658 days ago
	> That is true for pretty much all current deep reinforcement learning algorithms. Is that true? I was unaware that PPO, SAC, DQN, Impala, MuZero/AlphaZero etc would all automatically Just Work™ for hidden information games. Straight MCTS-inspired algorithms seem like they'd fail for reasons discussed in the paper, and while PPO/Impala work reasonably well in DoTA2/SC2, it's not obvious they'd converge to perfect play.

1 comments

fxtentacle 1658 days ago

You can mathematically prove for a lot of different algorithms (including PPO, DQN, IMPALA) that given enough experience with the game world, they will eventually converge to the optimal policy. It's just that the "enough experience" part might be so large that it's practically useless.

If I remember correctly, the DeepMind x UCL RL Lecture Series proves the underlying Bellman equation in this video: https://www.youtube.com/watch?v=zSOMeug_i_M

As for "hidden information" games, I thought the trick was to concatenate the current state with all past states and treat that as the new state, thereby making it an MDP.

link

gwern 1658 days ago

I don't think you can prove that (forgive me if I don't sit through a 2h video). Those all are susceptible to the deadly triad, and AFAIK there are no convergence proofs of any kind for the big model-free DL algs, and it would've been big news if someone had proved that a real-world version of PPO/DQN/IMPALA does in fact converge in the limit. Sutton's book and earlier proofs only cover cases where you drop the nonlinear approximator or something.

(History stacking may turn POMDPs into MDPs, but I don't know if they handle the specially adversarial nature of games like poker. That's quite different from stacking ALE frames.)

link

cygaril 1656 days ago

Standard RL algorithms will converge to optimal play versus a fixed opponent, but will not find an optimal policy via self play.

One intuitive way to see this is that a sequence of improving pure policies A < B < C < etc. will converge to optimal play in a perfect information game like chess, but not necessarily in an imperfect information game like rock/paper/scissors where Rock < Paper < Scissors < Rock, etc

link