| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ipython 323 days ago
	So I’ve heard of this model collapse theory. But I’ve also heard of model providers who are intentionally training with synthetically generated data (as a result of insufficient “real” data). So I’m curious where the line is? Are there phases in the training/continued pre training/alignment/rlhf pipeline where synthetic data isn’t just harmless but actually beneficial? Is it a question of quantity or a question of how much novelty is in the training data?