|
|
|
|
|
by musebox35
8 days ago
|
|
The most successful applications like coding are not the result of pure LLM/generative modeling. They come from closing the loop with an agentic harness. The generate-test-selectively refine loop is the core modality of scientific work. An LLM + RL with Verifiable Rewards + feedback from compiler/terminal runs mimics this process to a great extend. This is Fisher/Box feedback loop (https://www-sop.inria.fr/members/Ian.Jermyn/philosophy/writi...) implemented on a modern computational system. LLM is just a component. I wish Sutton had commented on this fuller picture of what we have now instead of commenting just on the LLM/Backprop side of things. I am honestly curious of whether such a loop can at least partially automate discovery. There are more elements to discovery though. It is still not clear where the initial working model/hypothesis comes from or how the updates are selected (unless it is just parameter induction). I recently read about Hanson's Patterns of Discovery which aims in that direction. I have still not read it, but I am curious if it has any mechanistic clues. |
|
That is a problem in RL, so we usually do supervised training first, teach it to imitate some trajectories, then do RL to refine the model. RL alone has a huge problem because it might be hard to reach a reward, hence hard to learn the task by pure reinforcement. Humans also combine supervision (learn from books) with search (solving problems) to break the discovery problem. For example, a human with no initial instruction in math would not produce great results no matter how smart they are. The bootstrap was exploration paid for in the past.