| >if LLM-generated content outpaces human-generated content, the useful data proportion will diminish I guess I’d ask why the author thinks that training LLMs on their own output will make them worse. Like, if the problem is that LLM-generated content is less useful than human-generated content because it’s “just averaging out inputs” (paraphrase of common argument, not quote from TFA), how does adding more data at the average change the distribution? >As is now, LLMs regularly hallucinate, generate biased content or fundamentally misinterpret the task even though nothing in the wider world has been adversarial to them. This really got me thinking about what is meant by “adversarial”. As in, adversarial with whom? The model itself? Its deployers? If I successfully trick ChatGPT, the system, into telling me some secrets about its inner workings, we can call that an attack on the commercial project as released by OpenAI, but can we call it an attack on the model itself? All the text used to train LLMs is heavily processed and filtered already. I think it’s more likely that, rather than LLM-made text diluting out the good training data, it will simply add to the corpus. Might add a few cycles to the line-level duplication step |