| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kiratp 467 days ago

Unless I’m missing something this isn’t online RL. They are collecting outputs in one pass and then doing a separate offline GRPO training run on those.

The results of this paper would indicate doing what they did, but online could return better results

https://arxiv.org/abs/2402.04792

1 comments

bradhilton 467 days ago

Technically yes, only if you do a gradient step with data sampled from the exact same weights is it an online step.

With our training recipe this can be easily done by accumulating the gradients across the entire batch and only doing one step with optimizer before sampling more responses.

In our experiments, however, we found the advantages of doing multiple gradient steps outweighed any potential drift in policy.

Ultimately the online-ness of data is on a spectrum and while more online data is better, other factors may be more important.

link

fc417fc802 467 days ago

> only if you do a gradient step with data sampled from the exact same weights is it an online step.

Bit pedantic, but amusing thought; wouldn't that imply that asynchronous actor critic is an offline training methodology?

link

bradhilton 467 days ago

Yes, pedantically, it is! But as I said, everything's on a spectrum. Online-ish data can still work just fine.

link