Hacker News new | ask | show | jobs
by raptortech 2078 days ago
This seems like a HUGE insight! As I understand it, they show that RL can effectively be recast as two sub-problems:

1. learning a policy that imitates your own behavior on prior experience, which is a trivial supervised learning problem

2. learning how to weight the importance of prior experiences (learning a data distribution), for which the authors have derived a lower bound

Given a pool of experience, this seems like a fantastic off-policy method to optimize arbitrary reward functions. The main shortcomings I see with this method is that it still does not lead to any significant insights into how to collect new data online, which is a major open problem in RL.

1 comments

> I'm also wondering why the authors didn't publish any experiments to show that it works...

This is a blog post. It cites three of the authors' papers that each contain empirical results. The abstract of the first ends:

"We formally show that this iterated supervised learning procedure optimizes a bound on the RL objective, derive performance bounds of the learned policy, and empirically demonstrate improved goal-reaching performance and robustness over current RL algorithms in several benchmark tasks."

> empirically demonstrate improved goal-reaching performance and robustness over current RL algorithms

It's interesting that their choice of current algorithms includes PPO but not e.g. Deepmind's Rainbow agent that achieved state of the art performance on many measures: https://arxiv.org/abs/1710.02298

They mention Rainbow in the related work section of the third paper listed there, Kumar, A., Peng, X. B., & Levine, S. (2019). Reward-Conditioned Policies. arXiv:1912.13465 as part of this remark: "they are also known to be notoriously challenging to use effectively, due to sensitivity to hyper parameters, high sample complexity, and a range of important and delicate implementation choices that have a large effect on performance [5, 6, 12, 15, 23, 24, 46]."