| HN Mirror

When optimizing high-dimensional policies, the gap in sample complexity between PPO (and policy gradient methods in general) and ES / random search is pretty big. If you compare the Atari results from the PPO and ES papers from OpenAI, PPO after 25M frames is better than ES after 1B frames. In these two papers, the policy parametrization is roughly the same, except that ES uses virtual batchnorm. For DOTA, with a much bigger policy, I'd expect the gap between ES and PPO to be much bigger than for Atari.

My takeaway from [0] and Rajeswaran's earlier paper is that one can solve the MuJoCo tasks with linear policies after appropriate preprocessing, so we shouldn't take them too seriously. That paper doesn't do an apples-to-apples comparison between ES and PG methods on sample complexity.

All of that said, there's not enough careful analysis comparing different policy optimization methods.

(Disclaimer: I am an author of PPO)