Hacker News new | ask | show | jobs
by albert_e 1208 days ago
How much of GPT generated text is going back right onto training new LLMs.

This has to eventually overwhelm organic human generated content doesn't it?

What's the way out of this

3 comments

Why does there need to be a way out? Everyone just seems to assume that feeding model output into the training set is going to break things, but I don't get why.

AlphaZero learned to play chess and go training purely on its own data. Why is inserting the best outputs from GPT-4 into the training set for GPT-5 expected to make things worse? To me, it sounds like it could even be desirable.

In chess there is a very clear victory state, and a scoring function can be implicitly defined from a large number of games of various skill levels.

You really don't throw two sentences into the thunder dome to decide which one "wins". Means it's much more susceptible to being poisoned.

>You really don't throw two sentences into the thunder dome to decide which one "wins".

That's almost literally what RLHF is though, and that is the last step of training GPT-n. Then when GPT-{n+1} is being trained, it will include some results from GPT-n, and therefore will benefit from that finetuning, even before it goes through its own round of RLHF. Also, on average good outputs of GPT-n are more likely to be included in the training set of GPT-{n+1} (because it ends up as a buzzfeed article or a top post on reddit or something), so there is an additional signal beyond the above.

I suspect the comment about the thunder dome was a reference to RLHF. On the one hand RLHF seems far superior to the kind of prompt engineering Microsoft seems to have relied on with Sydney. On the other, it's dubious that the manual selection in RLHF is really always selecting for quality, as against at least to some significant extent pandering to whatever biases or preferences the humans in the training loop might have.
That not what RLHF is. In the thunderdome, as in chess, you don't need human judges or an oracle to know who's won. That makes a significant difference to the training procedure.
That’s correct. I have seen the above argument a lot: Using analogy as a basis for proof!
> Why is inserting the best outputs from GPT-4 into the training set for GPT-5 expected to make things worse?

Firstly what makes you think only the best output from 4 will go into future training sets? It’s just as likely to be the most bizarre or ludicrous, or dangerous that gets shared and discussed.

But also, how will v5 get to be better than v4 if it’s trained significantly on v4 output? It would just end up being trained to be the same, to have the same flaws and quirks reinforced.

We already know v4 just makes stuff up, it’s incredibly good at producing well formatted plausible looking but utterly factually incorrect output. That’s because it has no concept of truth or facts. All it knows about from the token sequence weightings is the form of language, not the content. Feeding that back into future models is the last thing we should be doing.

>Firstly what makes you think only the best output from 4 will go into future training sets? It’s just as likely to be the most bizarre or ludicrous

That's true now, because LLMs are new so the failure cases are still interesting. If we are talking about a hypothetical world in which LLM outputs are a significant portion of the internet, then most of it would be from reddit comments/tweets/HN posts/buzzfeed articles/etc.

Then if you take only the ones which have more than average views/upvotes/etc. you should expect to get the 'best' results.

I'm still not convinced that's a reliable indicator of quality. It's potentially a measure of popularity or entertainment value, or maybe pandering to preconceptions but that's not at all the same thing.

Ask yourself, what are your from-scratch metrics for quality that you would like to select for. Then consider what are the likely or possible criteria people actually have for upvoting stuff on reddit. I'll think you'll find there is probably very little correlation between those. This is called the alignment problem and it's very hard to get right.

Correct output will be desirable. If you feed nonsense either human or AI generated you might break it.
Then we should encourage labeled ChatGPT content like ShareGPT, which can be easily avoided in future datasets because it is clearly labeled as AI-generated content.

It's the stuff that isn't labeled as generated with ChatGPT, et al, that will enter future training sets. I personally believe that's taking the "lossy JPEG" analogy too far, but I'm not an AI researcher.

While OpenAI keeps logs of every response ever returned, they can just filter that text out of any future training data.

Those logs aren't as large or unwieldy as they appear - the cost of storing a thousand words of text is tiny compared to the compute cost to generate it.

True, but presumably OpenAI won't be running the only publically available LLMs forever.
Watermarking. From an outsiders perspective, the issue appears to reaching consensus on how this can be implemented (but not in the technical sense). There's a game theoretic challenge in that if models define and publish detection mechanisms, this creates a motivation for people to use other systems that don't include this.

On the technical front there's a good paper here: https://arxiv.org/pdf/2301.10226.pdf, and a nice very approachable video explaining it here: https://www.youtube.com/watch?v=XZJc1p6RE78.

The problem with watermarking like this, which is incredibly clever, is it’s trivial to break. All you have to do is change one word in the text, and the watermarking of all subsequent tokens is spoiled. So if you change the first word, or rephrase the first sentence, or extract text from the middle or end of a response, the watermark is completely spoiled.
There can be redundancy in the watermark, meaning you'll have to change more than one word. See e.g. how error-correcting codes work.
There are definitely paths of attack. The trivial ones that you call out - insertion, deletion, substitution - are covered in section 7 of that paper (along with mitigations).