| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by brian_cloutier 429 days ago
	how so? modern post-training uses RL and immense amounts of synthetic data to iteratively bootstrap better performance. if you squint this is extremely similar to the AlphaZero approach of iteratively training using RL over data generated through self-play