|
|
|
|
|
by visarga
907 days ago
|
|
It's already how we fine-tune open source LLMs. All of them live off data exfiltrated from GPT-4. And it seems to help closing the gap fast. Microsoft had a whole family of papers on this idea: TinyStories, Phi-1, Phi-1.5, Phi-2... Synthetic data has many advantages - it is free of copyright issues, the downstream models can't possibly violate copyright if they never saw the copyrighted works to begin with. It is also more diverse and we can ensure higher average quality and less bias. It can also merge information across multiple sources. Sometimes we can filter using feedback from code execution, simulations, preference models or humans. If you can "execute" the LLM output and get a score, you're on to a self improving loop. LLMs can act as agents, collecting their own experiences and feedback. I think GPTs are a ploy by OpenAI to collect synthetic data with human-in-the-loop and tools, to improve their datasets. This would also be in-domain for users and for LLM errors. They would contain LLM errors and the feedback. Very good data, on-policy. My estimations for 100M users at 10K tokens per month per user is 1T synthetic tokens per month. In a year they double the size of the GPT-4 training set. And we're paying and working for it. But fortunately 12 months after they release GPT-5 we will recover 90% of its abilities in open source models. |
|
I feel like we don't know if this is true or not. If we decide models trained on copyrighted data aren't fair game, it's possible we'll decide "laundered" data also isn't.
I mean, maybe that's not feasible. And I hope we don't decide training on copyrighted material is bogus anyway. But I don't think we know yet.
But also - you can totally violate copyright of something you never saw.