Hacker News new | ask | show | jobs
by brian_cloutier 429 days ago
how so?

modern post-training uses RL and immense amounts of synthetic data to iteratively bootstrap better performance. if you squint this is extremely similar to the AlphaZero approach of iteratively training using RL over data generated through self-play