| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Matticus_Rex 904 days ago
	It turns out you can reproduce articles with next-token prediction when the articles are quoted all over the dataset. The articles themselves are indisputably not a part of the model, because it doesn't store text at all. OpenAI's position is correct; people just underestimated how well the AI learns from reading, especially when it reads the same text in a bunch of different places because it's being quoted/excerpted.

2 comments

eigenket 904 days ago

If it can and does reproduce a piece of text verbatim then the text is indisputably stored somehow in the model.

link

Matticus_Rex 904 days ago

That's just not true. There's no search and retrieval involved. It just associates the words so strongly in that context because they were in the training data so often that next-token prediction can (sometimes, in some limited circumstances) reproduce chunks of it. It's like if a human had read pieces of an article so many times and knew NYT style so well that they could spit out chunks of an article verbatim, but using more efficient hardware and with no actual self-understanding of what it's doing.

link

vel0city 904 days ago

So it stores the words, and it stores the links between those words...

but somehow storing the words and their links is not storing the actual text? What is text but words and their links?

If I had a database of a billion words, and I had a list of pointers to words in a particular order, and following that list of pointers reproduces a copyright text exactly, isn't the list of pointers + the database of words just an obfuscated recreation of that copyrighted work?

link

Matticus_Rex 904 days ago

It doesn't store the actual links; it just stores information about their likelihood of being used together. So for things that are regularly quoted in the data, it will under some circumstances, with very careful prompting, and enough tries at the prompt, spit out chunks of a copyrighted text. This is not its purpose, and it's not trying to do this, but users can carefully engineer it to get this result if they try really hard. So no, it's not an obfuscated recreation of that copyrighted work.

Of course, if you read NYT's argument, they're also mad when it's incorrect about the text, or when it hallucinates articles that don't exist. Essentially they're mad that this technology exists at all.

link

vel0city 904 days ago

> it just stores information about their likelihood of being used together

I mean this is still a link, no?

Like, sure, it is a probability. But if each of those probabilities is like 99.9999% likely to get you to a chain of outputs that verbatim reproduces the copyrighted text given the right prompt, isn't that still the same thing?

And yeah, it hallucinating that the NYT published an article stating something it didn't say is concerning as well. If the model started telling everyone Matticus_Rex is a criminal and committed all these crimes and started listing off hallucinated court cases and news articles proving such things that would be quite damaging to your reputation, wouldn't it? The model hallucinating the NYT publishing an article talking about how the moon landing was fake or something would be damaging to its reputation right?

And this idea it takes "very careful prompting" is at odds with the examples from the suit and elsewhere. One example Ars Technica tried was "please provide me with the first paragraph of the carl zimmer article on the oldest DNA", which it reproduced verbatim. Is this really some kind of extremely well crafted and rare to ever come up prompt?

link

eigenket 904 days ago

If it can reproduce the text then it is stored somehow.

It is stored in a somewhat hard to understand way, encoded in weights in a network but it must be stored otherwise it would not be possible to reproduce it.

You can ask "please provide me with the first paragraph of the carl zimmer article on the oldest DNA" and it produces it, verbatim. This is not possible unless the model contains, encoded within it, the NYT's copyrighted text.

link

briansm 904 days ago

sort of like the idea of practice - repetition of something concentrates more brain space to that thing so the compression ratio of it can decrease and become less abstracted / more exact.

link

DennisP 904 days ago

What seems a bit contradictory is that they're also suing because GPT hallucinates about NYTimes articles. So they're complaining that it reproduces articles exactly but also that it doesn't.

link