Hacker News new | ask | show | jobs
by TemplateRex 1436 days ago
But their NN already outputs a policy conditional on public and private info! Why not have a separate intermediate branch in the NN that is fed with the current estimate of private info (for both players) and outputs the policies (again for both players) given those info estimates? Wouldn't it be possible to learn from that?
1 comments

First, the neural network is taking the history of observations into account. We don't know what the NN has learned, but the NN is probably making some inference on likelihood of opponent piece locations. They haven't explicitly coded it to do that but it is difficult to imagine a human-level AI not doing this.

Second, what you are suggesting is probably best done as a secondary process outside of learning the Nash equilibrium. If you knew an opponent's policy, you would need to recalculate your optimal counterplay for that specific policy. This is completely orthogonal to the goal of this paper which is to learn the Nash equilibrium through self-play alone.