> Distillation isn’t an uncommon practice, but OpenAI’s terms of service prohibit customers from using the company’s model outputs to build competing AI.
I have the absolute tiniest of violins for this given OpenAI's behaviour vs everyone else's terms of service.
The default Llama 4 system prompt even instructs it to avoid using various ChatGPT-isms, presumably because they've already scraped so much GPT-generated material that it noticably skews their models output.
I recall that AI trained on AI output over many cycles eventually becomes something akin to noise texture as the output degrades rapidly.
Won’t most AI produced content put out into the public be human curated, thus heavily mitigating this degradation effect? If we’re going to see a full length AI generated movie it seems like humans will be heavily involved, hand holding the output and throwing out the AI’s nonsense.
Some will be heavily curated, by those who care about quality. This is a lot slower to produce, requires some expertise to do right, so there will be far less of it
The vast majority of content will be (is) the fastest and easiest to create - AI slop
I think there's probably a distinction to be made between deliberate, careful use of synthetic data, as opposed to blindly scraping 1PB of LLM generated SEO spam and force-feeding it into a new model. Maybe the former is useful, but the latter... probably not.
Interesting. The tonal change has definitely been noticeable. It also seems a bit more succinct and precise with its word choice, less flowery. That does seem to be in line with Gemini's behavior.
> Sam Paech, a Melbourne-based developer who creates “emotional intelligence” evaluations for AI, published what he claims is evidence that DeepSeek’s latest model was trained on outputs from Gemini. DeepSeek’s model, called R1-0528, prefers words and expressions similar to those that Google’s Gemini 2.5 Pro favors, said Paech in an X post.
And if you search for personal information of Android users, including location, sex, political orientation and location data, it is all there. /s
I have the absolute tiniest of violins for this given OpenAI's behaviour vs everyone else's terms of service.