Hacker News new | ask | show | jobs
by johnthewise 510 days ago
Emergent properties are nice. They show CoT now, but who knows if there is a better planning strategy? Second thing is it kind of implies every base model can be increased in capability just with some RL tuning, cheaply. So in theory you can plug in every observable and quantifiable outcome beyond math and coding(stock returns, scientific experiment results?) and let it learn how to plan it to solve it better. Train on Observed effects of various drugs on people, it then creates a customized treatment plan for you? Sft version would be limited by doctors opinion on why certain drugs affected the outcome, whereas RL version can discover unknown relationship.