| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by whimsicalism 558 days ago
	Online RL for LLMs means you are sampling from the model, scoring immediately, and passing gradients back to the model. As opposed to, sampling from the model a bunch, getting scores offline, and then fine tuning the model on those offline scored generations.