|
|
|
|
|
by foruhar
907 days ago
|
|
The future of training seems to, at least partly, be in synthetic data. I can imagine systems where a “data synthesizer” LLM is trained on open data and probably some licensed data. The synthesizer then generates data “to spec” to train larger models. MOE type models will likely have different approaches in so far as something like a Mathematical expert likely gets a long way with training data from out of copyright works by Newton, Euler, et al. |
|
Synthetic data has many advantages - it is free of copyright issues, the downstream models can't possibly violate copyright if they never saw the copyrighted works to begin with.
It is also more diverse and we can ensure higher average quality and less bias. It can also merge information across multiple sources. Sometimes we can filter using feedback from code execution, simulations, preference models or humans. If you can "execute" the LLM output and get a score, you're on to a self improving loop. LLMs can act as agents, collecting their own experiences and feedback.
I think GPTs are a ploy by OpenAI to collect synthetic data with human-in-the-loop and tools, to improve their datasets. This would also be in-domain for users and for LLM errors. They would contain LLM errors and the feedback. Very good data, on-policy. My estimations for 100M users at 10K tokens per month per user is 1T synthetic tokens per month. In a year they double the size of the GPT-4 training set. And we're paying and working for it.
But fortunately 12 months after they release GPT-5 we will recover 90% of its abilities in open source models.