|
|
|
|
|
by naturalgradient
2919 days ago
|
|
Yes, I am aware, I did not mean random search as in random actions, but random search with improved heuristics to find a policy. The point being that that the bells and whistles of PPO and other relatively complaticated algorithms (e.g. Q-PROP), namely the specific clipped objective, subsampling, and a (in my experience) very difficult to tune baseline using the same objective, do not significantly improve over gradient descent. And I think Ben Recht's arguments [0] expands on that a bit in terms of what we are actually doing with policy gradient (not using a likelihood ratio model like in PPO) but still conceptually similar enough for the argument to hold. So I think it comes down to two questions: How much do 'modern' policy gradient models improve on REINFORCE, and how much better is REINFORCE really than random search? The answer thus far seemed to be: not that much better, and I am trying to get a sense of if this was a wrong intuition. [0] http://www.argmin.net/2018/02/20/reinforce/ |
|
My takeaway from [0] and Rajeswaran's earlier paper is that one can solve the MuJoCo tasks with linear policies after appropriate preprocessing, so we shouldn't take them too seriously. That paper doesn't do an apples-to-apples comparison between ES and PG methods on sample complexity.
All of that said, there's not enough careful analysis comparing different policy optimization methods.
(Disclaimer: I am an author of PPO)