| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by NitpickLawyer 469 days ago

Don't have examples handy, but I did a round of grpo on a 7b model and it did indeed start to switch between english, coreean and chinese, but the reward was steadily increasing. RL doesn't care what the middle tokens are, as long as the end result gets the carrot.

I think there's still a lot to learn about reward functions (saw a team work w/ just correct output, and nothing else), if you should reward partial success (i.e. code compiles / math outputs a result) or just the final thing (i.e. test cases pass / correct answer) and so on.

Not to mention how to get downstream signals from e2e tasks (i.e. if an "agent" navigates to the correct webpage and finds a "cookie" or something, figure out how to reward all the intermediary steps out of that single binary signal).

And there's a lot to learn in using grammars & stuff w/ RL as well. The problem there is that the libraries are pretty wonky atm, some things work, some things need work, and RL in itself is pretty slow due to having to generate, update the model and generate again.