|
|
|
|
|
by highfrequency
1 day ago
|
|
The model’s distribution will certainly change from the base model’s output distribution during reinforcement learning, shifting toward outputs that score well on an external evaluation. This is very different from mode-seeking. Am I missing something? |
|
I'm not saying this makes it useless - it clearly helps for math and coding tasks. But the ceiling exists and that's what the original tweet was referring to. Alpha evolve also shows what lies beyond the ceiling, altho their planner was rudimentary.