|
|
|
|
|
by ipython
323 days ago
|
|
So I’ve heard of this model collapse theory. But I’ve also heard of model providers who are intentionally training with synthetically generated data (as a result of insufficient “real” data). So I’m curious where the line is? Are there phases in the training/continued pre training/alignment/rlhf pipeline where synthetic data isn’t just harmless but actually beneficial? Is it a question of quantity or a question of how much novelty is in the training data? |
|