| HN Mirror

> How did you get into contact with Schmidhuber for co-authoring? What stage was the research at when he joined?

The first time I discussed this topic with Jürgen Schmidhuber was at NIPS 2016, when he gave a talk about "Learning to Think" [1], during the break at one of the sessions, and we kept in contact afterwards.

> Were you expecting the net to generalize from dream to reality, before you wrote the paper, or did this materialize during experimentation?

When I tried this, I didn't expect this to work at all, to be honest! And in fact, as discussed in the paper, it didn't work at the beginning (the agent would just cheat the world model). That's why I tried to adjust the temperature parameter to control the stochasticity of the generated environment, and trained the agent inside a more difficult dream.

> Do you expect this approach is also feasible for more difficult games: higher dimensionality, longer delayed rewards?

I expect the iterative training approach to be promising for difficult games with higher dimensionality, where we need to use better V and M models with more capabilities and capacities (we can already find many candidates for V/M already by looking at the deep learning literature), and still train these models efficiently with backprop on GPUs/TPUs. Using policy search methods such as evolution (or even augmented random search), allow us to work only with cumulative rewards we see at the end, rather than demanding a dense reward signal at every single time step, and I think this will help cope with environments with sparse, delayed rewards. Even in the experiments in this paper, we only work with cumulative rewards at the end of each rollout, and we don't care about intermediate rewards.

> Both congrats and thanks for writing this very accessible paper. Really found this a creative paper with a lot of inspiration, and the presentation of the results was marvelous. (BTW: I remember you from the RNN-volleyball game. Back then you had quite some jealous detractors, telling you DeepMind would be too difficult/academic for you. You sure shut those people up!)

Thanks! The RNN-volleyball game from 2015 was a lot of fun to make. Back then, I trained the agents using self-play, with evolution, and I remember people telling me I should really be using DQN or something back then. Fast forward a few years, self-play is now a really popular area of research (for instance, many nice works from OpenAI and DeepMind last year), and evolution methods are really making a comeback. I think it is best to work with something you believe in, and sometimes it is okay to not pursue what everyone else is doing.

[1] On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models https://arxiv.org/abs/1511.09249