Hacker News new | ask | show | jobs
by threethirtytwo 146 days ago
Models don’t reproduce books though. It’s impossible for a model to reproduce something word for word because the model never copied the book.

Most of the best fit curve runs along a path that doesn’t even touch an actual data point.

3 comments

They do memorize some books. You can test this trivially by asking ChatGPT to produce the first chapter of something in the public domain -- for example a Tale of Two Cities. It may not be word for word exact, but it'll be very close.

These academics were able to get multiple LLMs to produce large amounts of text from Harry Potter:

https://arxiv.org/abs/2601.02671

In that case I would say it is the act of reproducing the books that is illegal. Training the AI on said books is not.

So the illegality rests at the point of output and not at the point of input.

I’m just speaking in terms of the technical interpretation of what’s in place. My personal views on what it should be are another topic.

> So the illegality rests at the point of output and not at the point of input.

It's not as simple as that, as this settlement shows [1].

Also, generating output is what these models are primarily trained for.

[1]: https://www.bbc.com/news/articles/c5y4jpg922qo

Unfortunately a settlement doesn't really show you anything definitive about the legality or illegality of something.

It only shows you that the defendant thought it would be better for them to pay up rather than continue to be dragged through court, and that the plaintiff preferred some amount of certain money now over some other amount of uncertain money later, or never.

We cannot say with any amount of confidence how the court would have ruled on the legality, had things been allowed to play out without a settlement.

>Also, generating output is what these models are primarily trained for.

Yes but not generating illegal output. These models were trained with intent to generate legal output. The fact that it can generate illegal output is a side effect. That's my point.

If you use AI to generate illegal output, that act is illegal. If you use AI to generate legal output that act is not illegal. Thus the point of output is where the legal question lies. From inception up to training there is clear legal precedence for the existence of AI models.

If there is one exact sentence taken out of the book and not referenced in quotes and exact source, that triggers copyright laws. So model doesnt have to reproduce the entire book, it only required to reproduce one specific sentence (which may be a characteristic sentence to that author or to that book).
If there is one exact sentence taken out of the book and not referenced in quotes and exact source, that triggers copyright laws.

Yes, and that's stupid, and will need to be changed.

Sure, but that use would easily pass a fair use test, at least in the US.
Models absolutely do reproduce books.

> With a simple two-phase procedure, we show that it is possible to extract large amounts of in-copyright text from four production LLMs. While we needed to jailbreak Claude 3.7 Sonnet and GPT-4.1 to facilitate extraction, Gemini 2.5 Pro and Grok 3 directly complied with text continuation requests. For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984.

https://arxiv.org/abs/2601.02671

The supplementary files in that paper—verbatim reproductions of the full texts of Frankenstein and The Great Gatsby—are pretty instructive. The research group highlighted all additions and omissions, but on most pages the differences are difficult to spot because they are only missing spaces, extra hyphens, and other typographical minutiae.