Hacker News new | ask | show | jobs
by SimianSci 554 days ago
For a while there I would have been in agreeance with you, but the thought that models can be trained purely on synthetic data has shown to be wrong on multiple levels. Synthetic data needs to be reviewed by individuals to ensure data quality, significantly reducing the speed at which an organization can adopt training data. Reasonable engineers would suggest that the answer to this is to have other language models review the synthetic data, but we have seen that this is what leads to model collapse due to compounding issues around hallucinations.

At best Synthetic data is a "slow follow" for training a model due to the need for human review, but a competitive model, it does not make.