| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bonzaidrinkingb 935 days ago
	That is a pretty convoluted and expensive way to use ChatGPT as an internet search. I see the vulnerability, but I do not see the threat. I've seen it "exploited" way back when ChatGPT was first introduced, and a similar trick worked for GPT-2 where random timestamps would replicate or approximate real posts from anon image boards, all with a similar topic.

4 comments

NicuCalcea 935 days ago

I think it may change the discussion about copyright a bit. I've seen many arguments that while GPTs are trained on copyrighted material, they don't parrot it back verbatim and their output is highly transformative.

This shows pretty clearly that the models do retain and return large chunks of texts exactly how they read them.

link

bonzaidrinkingb 935 days ago

I suspect ChatGPT is using a form of clean-room design to keep copyrighted material out of the training set of deployed models.

One model is trained on copyrighted works in a jurisdiction where this is allowed and outputs "transformative" summaries of book chapters. This serves as training data for the deployed model.

link

LeifCarrotson 935 days ago

The article describes how the deployed model can regurgitate chunks of copyrighted works - one of the samples literally ends in a copyright notice.

link

bonzaidrinkingb 935 days ago

If these were copyrighted works, how did these end up in the public comparison dataset?

Sure, some copyrighted works ended up in the Pile by accident. You can download these directly, without the elaborate "poem" trick.

link

a1o 935 days ago

That sounds like copyright washing if there is such thing.

link

jnwatson 935 days ago

If that's copyright washing so are Cliff's Notes.

link

xp84 935 days ago

Yup, though a lot of people are acting now as though every already-established principle of fair use needs to be revised suddenly by adding a bunch of "...but if this is done by any form of AI, then it's copyright infringement."

A cover band who plays Beatles songs = great An artist who paints you a picture in the style of so-and-so = great

An AI who is trained on Beatles songs and can write new ones = exploitative, stealing, etc. An AI who paints you a picture in the style of so-and-so = get the pitchforks, Big Tech wants to kill art!

link

blitzar 935 days ago

> A cover band who plays Beatles songs

Has to pay the Beatles for the pleasure of doing so.

link

whstl 934 days ago

This discussion about art "in the style of" being stealing or exploitative hasn't started with AI. For quite some time there has been complaints of advertisements commissioning sound-alike tunes to avoid paying licensing. AI is only automating it and making it possible in an industrial scale.

link

lewhoo 935 days ago

Well, I don't know about that. I strongly suspect chatgpt could deliver whole copyrighted books piece by piece. I suspect that because it most certainly can do that with non-copyrighted text. Just ask it to give you something out of the Bible or Moby Dick. Cliff Notes can't do that.

link

whatshisface 935 days ago

Why would you suspect that?

link

mariojv 935 days ago

To me, it seems like more of a competitive issue for OpenAI if part of their secret is the ability to synthesize good training data, or if they're purchasing training data from some proprietary source.

link

valine 935 days ago

I suspect OpenAI’s advantage is their ability to synthesize a good fine tuning dataset. My question would be is this leaking data from the fine tuning dataset or from the initial training of the base model? The base model training data is likely nothing special.

link

bonzaidrinkingb 935 days ago

Good point. But many are already directly training on output from GPT. Probably more efficient than copying the raw training data. Especially if it relies on this non-targeted approach.

link

dvfjsdhgfv 935 days ago

> I do not see the threat.

It becomes one if for some reason you decide to train your model on sensitive data.

link

bonzaidrinkingb 935 days ago

In certain circumstances, I could see that.

Then again, if you have access to a model trained on sensitive data, why not ask the model directly, instead of probing it for training data? If sensitive data never is meant to be reasoned on and outputted, why did you train on sensitive data in the first place?

link

dvfjsdhgfv 935 days ago

The entity training the data and the users of the model are not necessarily the same entity. Asking the model directly will not (or: shouldn't) work if there are guardrails in place not to give specific information. As for the reason, there are many, one of them being the fact that you train your model on such a huge number of items you can't guarantee there is nothing that shouldn't be there.

link

bonzaidrinkingb 935 days ago

If there are guardrails in place not to output sensitive data (good practice anyway), then how would this technique suddenly bypass that?

I still have trouble seeing a direct threat or attack scenario here. If it is privacy sensitive data they are after, a regex on their comparison index should suffice and yield much more, much faster.

link

munro 935 days ago

I think the exploit would be training on ChatGPT users' chat history.

> Chat history & training > Save new chats on this browser to your history and allow them to be used to improve our models. Unsaved chats will be deleted from our systems within 30 days. This setting does not sync across browsers or devices. Learn more

link

bonzaidrinkingb 935 days ago

If ChatGPT ever outputs other user's chat history, the company is as good as dead. If that could be exploited using this technique that is out in the wild for over a year: show me the data.

link

whywhywhywhy 935 days ago

Already has, https://www.bbc.co.uk/news/technology-65047304

link

timfsu 935 days ago

That was a regular frontend bug though, not an issue with the LLM

link

Jensson 935 days ago

It is an issue with the company though. I saw that as well. The point is that leaking user data doesn't destroy startups, it barely even hurts well established companies.

link

zer0c00ler 934 days ago

Read OpenAI's response to this security issue carefully - it tells you a lot about how they think of being responsible for issues like this. I remember they put all the blame on the open source library, rather than taking responsibility themselves.

link