|
|
|
|
|
by whimsicalism
558 days ago
|
|
Online RL for LLMs means you are sampling from the model, scoring immediately, and passing gradients back to the model. As opposed to, sampling from the model a bunch, getting scores offline, and then fine tuning the model on those offline scored generations. |
|