Hacker News new | ask | show | jobs
by nialv7 490 days ago
I am skeptical. Intuitively I don't see what self-play achieves beyond straight RL. Have the authors done a comparison with the performance they can get by RL finetuning a single model by itself?

Also this style of tasks is prone to overfitting. i.e. instead of predicting, the model just memorises what the results are.

1 comments

Great question!

The key advantage of self-play is that we don't actually have labels for the "right" probability to assign any given question, only binary outcomes - each event either happened (1.0) or did not happen (0.0).

Our thinking was that by generating multiple predictions and ranking them by proximity to the ground truth, self-play incentivizes each agent to produce more finely calibrated probabilities - or else the other agent might come just slightly closer to the actual outcome.