Hacker News new | ask | show | jobs
by wegfawefgawefg 635 days ago
In reinforcement learning pre-training reduces peak performance. We can argue about this, but it is not a sufficiently strong point to stop reading from alone.
2 comments

Do you have a citation for this? I did my Phd on this topic 8 years ago, and I didn't completely follow the field after. I'm curious to learn more.
Not off the top of my head, but back when self play was first being figured out the competing strategy was behavioural cloning, and there was some flirting with bootstrapping self play with initial behavioural cloning. It would always bias the policy and reduce exploration. You end up with a worse final policy. Best to train from scratch. All the top rl papers did no behavioural pretraining and beat out the ones that did by many orders of magnitude on scores.

We are going to relearn this lesson with ambulation and grasping as all the large companies are trying to make useful robots from human shadowing to reduce the gigantic sample size requirements burden with self play. Likely after the initial years computers will just get a couple more doublings in compute per watt and we will see the full self training models take over those domains as the old human data biased models get thrown out.

See this ISPD 2022 paper where the AlphaChip authors dive more into the value of pre-training (Figure 7, Figure 8): https://dl.acm.org/doi/pdf/10.1145/3505170.3511478