|
|
|
|
|
by raptortech
2078 days ago
|
|
This seems like a HUGE insight! As I understand it, they show that RL can effectively be recast as two sub-problems: 1. learning a policy that imitates your own behavior on prior experience, which is a trivial supervised learning problem 2. learning how to weight the importance of prior experiences (learning a data distribution), for which the authors have derived a lower bound Given a pool of experience, this seems like a fantastic off-policy method to optimize arbitrary reward functions. The main shortcomings I see with this method is that it still does not lead to any significant insights into how to collect new data online, which is a major open problem in RL. |
|
This is a blog post. It cites three of the authors' papers that each contain empirical results. The abstract of the first ends:
"We formally show that this iterated supervised learning procedure optimizes a bound on the RL objective, derive performance bounds of the learned policy, and empirically demonstrate improved goal-reaching performance and robustness over current RL algorithms in several benchmark tasks."