|
|
|
|
|
by t55
382 days ago
|
|
> prolonged RL training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling does this mean that previous RL papers claiming the opposite were possibly bottlenecked by small datasets? |
|