Hacker News new | ask | show | jobs
by whimsicalism 558 days ago
Online RL for LLMs means you are sampling from the model, scoring immediately, and passing gradients back to the model.

As opposed to, sampling from the model a bunch, getting scores offline, and then fine tuning the model on those offline scored generations.