| What you are talking about here is called "label bias". [2] It is present only if training is done badly. When you have a game of Go, or Super Mario level. You don't want to make your decisions by just checking the local features and doing them, because it can be the case that by compounding errors you end up in a state you never saw, and all of the future decisions won't be good. One can avoid these situations by training jointly over the whole game. For example, maximum entropy models can work for decision making problems but their training leaves them in a "label bias" state because the training is trying to minimize loss of local decisions, instead of trying to minimize the future regret of current local decision. The solution to these label bias problems are Conditional Random Fields, or Hidden Markov Models. You could accomplish the same with Recursive Neural Networks if you trained them properly. For example, there is no search part (monte carlo tree search, or dynamic programming [viterbi] like it is in CRFs or HMMs) in RNNs but they are completely adequate for decision based problems (sequence labeling etc.). Why is that the case? Because search results are present in the data, there's no need to search if you can just learn to search from the data. If DeepMind open-sourced the hundreds of millions of games that AlphaGo played, it is quite possible to train a model that wouldn't need a Monte Carlo search and would work quite well, because you would learn the model to make local decisions to minimize future regret, not to minimize its local loss. [1] The only reason why reinforcement learning is used is because there are too few human games of Go available for the model to generalize well. Reinforcement learning can be used in the setting of joint learning because you play out the whole game before you do the learning. This means that you can try to learn a classifier that will minimize the regret by making a proper local decision. Although, as far as I know, and can see from the paper, they didn't train AlphaGo jointly over the game sequence. But! Now they have a lot of data and they can repeat the process. [1]: http://arxiv.org/abs/1502.02206 [2]: http://repository.upenn.edu/cgi/viewcontent.cgi?article=1162... |
I fail to see how a "per-state normalization of transition scores" translates to there being a bias in value networks towards states with fewer outgoing transitions.