| HN Mirror

Training via bootstrapping (i.e. dynamic programming) does reduce the state space for search when working with function approximation. It represents a bias in what sorts of values the value function approximator should predict for each state. It encodes a sort of "local continuity/coherence" constraint that wouldn't necessarily be induced by simply training to predict the raw values -- collected stochastically via interaction with the environment. This local coherence constraint acts as a regularizer (i.e. bias) while training the value function approximator.