| HN Mirror

One example with model-based RL is "World Models" by Ha and Schmidhuber. They pre-train an autoencoder to reduce image observations to vectors, then pre-train a RNN to predict future reduced vectors, then use a parameter-space global optimization algorithm (but any RL algorithm would work) to train a policy that's linear in the concatenated observation vector and RNN hidden state.

The important thing here is that the image encoder and the RNN weren't trained end-to-end with the policy. The learned "features" captured enough information to be an effective policy input, even though they only needed to be useful for predicting future states.

It's also interesting that the image encoder was trained separately from the RNN. I think that only worked because the test environments were "almost" fully observable - there is world state that cannot be inferred from a single image observation, but knowing that state is not necessary for a good policy.