|
|
|
|
|
by highfrequency
1 day ago
|
|
> RLVR still does not expand beyond the base distribution though, it only mode-seeks within it. Seems clearly false. Pretraining finds the mean/mode of the data distribution. RL can easily generate many samples around that mode, evaluate them on an external source of truth (eg compile the code and run it) and then selectively train on the good samples. This clearly can go beyond the initial data distribution. |
|