Hacker News new | ask | show | jobs
by svc0 974 days ago
They are most certainly being fed LLM content. However, I think this "model collapse" narrative is over-subscribed. Here are some things to keep in mind:

(1) Real content is not generated via a synthetic loop: Humans use generative AI in complex ways, intermixing human-generated and AI-generated content. Imagine a person who writes the first draft of an essay, then uses ChatGPT to rewrite parts of it. These are certainly many human additions, modifications, and stylistic flourishes.

(2) The most dramatic effects of model collapse were seen when training multiple generations of AI agents on content generated by the previous agent. This is a very academic scenario.

(3) There is already a lot of junk consumed by these models. RLHF is aimed at eliminating these junk responses. I am not aware of any research that explores how the full training cycle is affected when RLHF is employed.

Also, there is a lot of training material out there that was not used by the original GPT-3 model. The primary limitation is hardware.

3 comments

I have come across an increasing number of obviously generated content. Recipes, product reviews, and anything Buzzfeed was known for. I only expect more and more of it. Just wait until 2024's "top 38 React server component state management libraries you need to learn this year" posts come up on dev.to.

Edit: well look at that. I'm not saying this was generated, but it might as well could be. These "learn from these repos" posts are everywhere now.

https://dev.to/triggerdotdev/17-javascript-repositories-to-b...

It's also fairly well established now, I believe, that part of OpenAI's secret sauce is focusing on high-quality data sources; that is, probably those least likely to include unmodified ChatGPT outputs.
> This is a very academic scenario.

Is it going to remain academic? I can easily imagine the spammy content farm / listicle business model evolving to be fully automated, creating an input loop.

Sure, there will be some pollution. It's very multivariate and depends on factors like content split, generation quality, and novel information. A scenario in which all of your data is generated by the previous model and you run n training loops is academic.

It's also worth noting that when OpenAI created Whisper, they had to heuristically remove many transcripts from poor ASR systems, and they definitely didn't catch them all.