| HN Mirror

Not off the top of my head, but back when self play was first being figured out the competing strategy was behavioural cloning, and there was some flirting with bootstrapping self play with initial behavioural cloning. It would always bias the policy and reduce exploration. You end up with a worse final policy. Best to train from scratch. All the top rl papers did no behavioural pretraining and beat out the ones that did by many orders of magnitude on scores.

We are going to relearn this lesson with ambulation and grasping as all the large companies are trying to make useful robots from human shadowing to reduce the gigantic sample size requirements burden with self play. Likely after the initial years computers will just get a couple more doublings in compute per watt and we will see the full self training models take over those domains as the old human data biased models get thrown out.