Hacker News new | ask | show | jobs
by MacsHeadroom 830 days ago
This is exactly right. Model collapse does not exist in practice. In fact, LLMs trained on newer web scrapes have increased capabilities thanks to the generated output in their training data.

For example, "base" pretrained models trained on scrapes which include generated outputs can 0-shot instruction follow and score higher on reasoning benchmarks.

Intentionally produced synthetic training data takes this a step further. For SoTA LLMs the majority of, or all of, their training data is generated. Phi-2 and Claude 3 for example.

5 comments

Ironically Claude 3 appears to have certain "quirks" arguably caused by the fact that its training data contains synthetic data. In one instance (https://twitter.com/DimitrisPapail/status/176477229891207585...), it kept referring to itself as ChatGPT.

Granted, one could argue that this only happened because the API version of Claude doesn't appear to use a system prompt. If that's the case, then the LLM lacks any identity otherwise defined by the initial system prompt, and thus, kind of makes one up.

Nonetheless, point remains, it's kind of interesting to see that in the years since the launch of ChatGPT we're already seeing a tangible impact on publicly available training data. LLMs "know" what ChatGPT is, and may even claim to be it.

that is the meat the article tries to cook. the impacts so far aren’t all that negative.

but time flows like a river, and the more shit that gets into it…

poison does not need to be immediately fatal to be fatal. some take a frighteningly long time to work. by the time you know what’s happening, not only is it too late, you have already suffered too much.

does this sound like anything more than a scary story to tell around campfires? not yet.

Claude 3 does use publically available data. Not everything is synthetically generated. Look at the section for training data in the below link. It has an quote from the paper which states that it uses a mix of public data, data from labelers and synthetic data

https://www.lesswrong.com/posts/JbE7KynwshwkXPJAJ/anthropic-...

I can't find a link to the actual clause paper to verify the above link but a few other places mention the same thing about the training data. We don't know if this improved performance is because of synthetic data or something else. I'm guessing even antropic might not be knowing this too.

Wouldn’t reinforcement learning just weigh any nonsense data very low and then spammy garbage doesn’t really affect the model in the end much ? If the model and human experts can’t tell the difference then it’s probably pretty good AI generated data
Truth and what humans think is true are different things. Synthetic data was created by models that were trained to be convincing.
the ideal poison tastes like nothing, or at the very least doesn’t taste bad.

you wouldn’t want to alert the victim.

What happens if you train a model on nothing but AI-generated output, recursively? Does it eventually get inbred?
Why would you limit a model to be like a brain in a vat? Instead let the model out so people use it, then use the chat logs to fine-tune. A chat room is a kind of environment, there is a human, maybe some tools. The LLM text will generate feedback and right there is a learning signal.

Even without a human, if a LLM has access to code execution it can practice solving coding tasks with runtime feedback. There are many ways a LLM could obtain useful learning signals. After all, we got all our knowledge from the environment as well, in the end there is no other source for knowledge and skills.

I want to observe that one of my favorite youtubers did exactly this with making the "uppest case" and "lowest case" letters.

https://www.youtube.com/watch?v=HLRdruqQfRk

I love this guy so much and wish he made far more videos.

Depends how good the AI output is, just like it depends how good the natural output is.

If most of it is bad but you can get a better AI to tag it as bad, then it's not necessarily a problem.

Without human input, yes.
Does AlphaZero get inbred?
>model collapse does not exist in practice

Dude what? That’s a pretty absurd claim. Most generally available models specifically curate their inputs for the express purpose of avoiding AI garbage induced collapse. It’s literally on their cited reasons for avoiding ai generated data as inputs.