| HN Mirror

It's not impossible, it's called inverse reinforcement learning, where they learn a value function from an external demonstration. Then they use this value function for teaching the bot an action policy. Intuitively, the idea is to learn first what are a good state and a bad state, based on external demonstrations, then use that to teach the bot how to act.

This kind of learning is similar to GANs, where the discriminator learns from real data and the generator learns from the discriminator.