| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by samstave 476 days ago

>>In the initial stage, we scale RL specifically for math and coding tasks. Rather than relying on traditional reward models, we utilized an accuracy verifier for math problems to ensure the correctness of final solutions and a code execution server to assess whether the generated codes successfully pass predefined test cases

They should call this the siphon/sifter model of RL.

You siphon only the initial domains, then sift to the solution....