|
|
|
|
|
by lalaland1125
377 days ago
|
|
This blog post is unfortunately missing what I consider the bigger reason why Q learning is not scalable: As horizon increases, the number of possible states (usually) increases exponentially. This means you require exponentially increasing data to have a hope of training a Q that can handle those states. This is less of an issue for on policy learning, because only near policy states are important, and on policy learning explicitly only samples those states. So even though there are exponential possible states your training data is laser focused on the important ones. |
|
An exponential number of states only matters if there is no pattern to them. If there is some structure that the network can learn then it can perform well. This is a strength of deep learning, not a weakness. The trick is getting the right training objective, which the article claims q learning isn't.
I do wonder if MuZero and other model based RL systems are the solution to the author's concerns. MuZero can reanalyze prior trajectories to improve training efficiency. The Monte Carlo Tree Search (MCTS) is a principled way to perform horizon reduction by unrolling the model multiple steps. The max operator in MCTS could cause similar issues but the search progressing deeper counteracts this.