Hacker News new | ask | show | jobs
by dpaleka 1221 days ago
More search won't do good, but why wouldn't targeted training help? The way I see it is that the adversarial policy search discovers positions which are off-distribution for anything seen in the victim's self-play training.

But training on that particular sort of adversarial states should help against the human player which has learned the strategy, just like training on patch adversarial examples in vision helps against the same type of patches.

Of course if the adversarial policy is again allowed to find off-distribution states (by playing against the victim), it will certainly find ways to beat it, until the model is playing perfectly. (Emergent gradient obfuscation could also theoretically happen, but I don't know if it has been demonstrated to actually happen.)

1 comments

More targeted training won't do good, but why wouldn't more search help?

We've apparently entered the stage where the deciding factor between who wins, man or machine, is just an arms race.

More targeted training won't do good, but why wouldn't more search help?

My understanding is that gwern above linked solid evidence in the paper for more search not being enough, as in, the model's evaluation NN is so way off target when searching, that realistic amounts of search don't help. Go seems to have many possible moves per position, so the search doesn't go very deep anyway.

Feel free to correct me if I'm wrong, it might be that I misremembered how AlphaGo-style systems work.