| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lewtun 347 days ago
	Indeed we opted for offline methods like Anchored Preference Optimization as we found in the Open R1 project that doing multi-task RL on small models is quite a hassle to get right. With offline methods, you focus much more on dataset curation / generation, but that still provides faster iteration cycles for the model scale we’re dealing with!