Hacker News new | ask | show | jobs
by hervature 1441 days ago
This is an interesting avenue for future research. The reason why it is not as straightforward as you claim is because all inference is going to depend on your perception of their policy. That's why the Nash equilibrium is sought after first. Because you should assume your opponent is perfect until you start observing their suboptimal behavior that you can exploit. Additionally, you would also have to handle the meta part where the exploiting portion of the algorithm isn't itself being exploited by the opponent. Somehow, you should deviate slowly from the Nash equilibrium but revert quickly if the opponent is abusing your new strategy.
1 comments

But their NN already outputs a policy conditional on public and private info! Why not have a separate intermediate branch in the NN that is fed with the current estimate of private info (for both players) and outputs the policies (again for both players) given those info estimates? Wouldn't it be possible to learn from that?
First, the neural network is taking the history of observations into account. We don't know what the NN has learned, but the NN is probably making some inference on likelihood of opponent piece locations. They haven't explicitly coded it to do that but it is difficult to imagine a human-level AI not doing this.

Second, what you are suggesting is probably best done as a secondary process outside of learning the Nash equilibrium. If you knew an opponent's policy, you would need to recalculate your optimal counterplay for that specific policy. This is completely orthogonal to the goal of this paper which is to learn the Nash equilibrium through self-play alone.