Hacker News new | ask | show | jobs
by mike_hearn 932 days ago
I'm not sure how this is an attack. Is it actually vital that models don't repeat their training data verbatim? Often that's exactly the answer the user will want. We are all used to a similar "model" of the internet that does that: search engines. And it's expected and required that they work this way.

OpenAI argue that they can use copyrighted content so repeating that isn't going to change anything. The only issue would be if they had used stolen/confidential data to train on, and it was discovered that way, but it also seems unlikely anyone could easily detect that given that there'd be nothing to intersect it with, unlike in this paper.

The blog post seems to slide around quite a bit, roving from "it's not surprising to us that small amounts of random text is memorized" straight to "it's unsafe and surprising and nobody knew". The nobody knew idea, as Jimmc414 has nicely proven in this thread, is false alarm because their technique actually was detected and the paper authors just didn't know that it had been. And "it's unsafe" doesn't make any sense in this context. Repeating random bits of memorized text surrounded by huge amounts of original text isn't a safety problem. Nor is it an "exploit" that needs to be "patched". OpenAI could ignore this problem and nobody would care except AI alignment researchers.

The culture of alarmism in AI research is vaguely reminiscent of the early Victorians who argued that riding trains might be dangerous, because at such high speeds the air could be sucked out of the carriages.

1 comments

Speaking of remembering training data, I see that as a big problem with chat based systems. They swallow a bunch of data, then generate something when prompted, My worry is not so much copyright infringement but more something like citation needed?

Has anyone done any work to produce citations for the generated data?

Some work, yeah. It's still an open problem to do it well, but I think the folks at Anthropic have made a reasonable start with their work[1] on influence functions ("tracing model outputs to the training data"). Basically their work attempts to answer the question "what particular training data most strongly influenced the model to give the answer it did", by doing some fancy math that I think is equivalent to taking the gradient produced by each piece of training data, computing the derivative of loss on the output of interest as the gradient is applied to the model, and then using that as the answer.

Though it sounds like even their much cheaper clever approach is still very expensive.

[1] paper at https://arxiv.org/abs/2308.03296, post at https://www.anthropic.com/index/influence-functions