Hacker News new | ask | show | jobs
by threethirtytwo 149 days ago
It does make sense. It’s controversial. Your memory memorizes things in the same way. So what nvidia does here is no different, the AI doesn’t actually copy any of the books. To call training illegal is similar to calling reading a book and remembering it illegal.

Our copyright laws are nowhere near detailed enough to specify anything in detail here so there is indeed a logical and technical inconsistency here.

I can definitely see these laws evolving into things that are human centric. It’s permissible for a human to do something but not for an AI.

What is consistent is that obtaining the books was probably illegal, but say if nvidia bought one kindle copy of each book from Amazon and scraped everything for training then that falls into the grey zone.

6 comments

> To call training illegal is similar to calling reading a book and remembering it illegal.

Perhaps, but reproducing the book from this memory could very well be illegal.

And these models are all about production.

To be fair, that seems to be where some of the IA lawsuits are going. The argument goes that the models themselves aren't derivative works, but the output they produce can absolutely be - in much the same way that reproducing a book from memory could be copyright violation, trademark infringement, or generally go afoul of the various IP laws.
Models don’t reproduce books though. It’s impossible for a model to reproduce something word for word because the model never copied the book.

Most of the best fit curve runs along a path that doesn’t even touch an actual data point.

They do memorize some books. You can test this trivially by asking ChatGPT to produce the first chapter of something in the public domain -- for example a Tale of Two Cities. It may not be word for word exact, but it'll be very close.

These academics were able to get multiple LLMs to produce large amounts of text from Harry Potter:

https://arxiv.org/abs/2601.02671

In that case I would say it is the act of reproducing the books that is illegal. Training the AI on said books is not.

So the illegality rests at the point of output and not at the point of input.

I’m just speaking in terms of the technical interpretation of what’s in place. My personal views on what it should be are another topic.

> So the illegality rests at the point of output and not at the point of input.

It's not as simple as that, as this settlement shows [1].

Also, generating output is what these models are primarily trained for.

[1]: https://www.bbc.com/news/articles/c5y4jpg922qo

Unfortunately a settlement doesn't really show you anything definitive about the legality or illegality of something.

It only shows you that the defendant thought it would be better for them to pay up rather than continue to be dragged through court, and that the plaintiff preferred some amount of certain money now over some other amount of uncertain money later, or never.

We cannot say with any amount of confidence how the court would have ruled on the legality, had things been allowed to play out without a settlement.

>Also, generating output is what these models are primarily trained for.

Yes but not generating illegal output. These models were trained with intent to generate legal output. The fact that it can generate illegal output is a side effect. That's my point.

If you use AI to generate illegal output, that act is illegal. If you use AI to generate legal output that act is not illegal. Thus the point of output is where the legal question lies. From inception up to training there is clear legal precedence for the existence of AI models.

If there is one exact sentence taken out of the book and not referenced in quotes and exact source, that triggers copyright laws. So model doesnt have to reproduce the entire book, it only required to reproduce one specific sentence (which may be a characteristic sentence to that author or to that book).
If there is one exact sentence taken out of the book and not referenced in quotes and exact source, that triggers copyright laws.

Yes, and that's stupid, and will need to be changed.

Sure, but that use would easily pass a fair use test, at least in the US.
Models absolutely do reproduce books.

> With a simple two-phase procedure, we show that it is possible to extract large amounts of in-copyright text from four production LLMs. While we needed to jailbreak Claude 3.7 Sonnet and GPT-4.1 to facilitate extraction, Gemini 2.5 Pro and Grok 3 directly complied with text continuation requests. For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984.

https://arxiv.org/abs/2601.02671

The supplementary files in that paper—verbatim reproductions of the full texts of Frankenstein and The Great Gatsby—are pretty instructive. The research group highlighted all additions and omissions, but on most pages the differences are difficult to spot because they are only missing spaces, extra hyphens, and other typographical minutiae.
> To call training illegal is similar to calling reading a book and remembering it illegal.

A type of wishful thinking fallacy.

In law scale matters. It's legal for you to possess a single joint. It's not legal to possess 400 tons of weed in a warehouse.

It is not the scale that matters here, in your example, but intent. With 1 joint, you want to smoke yourself. With 400, you very possibly want to sell it to others. Scale in itself doesnt matter, scale matters only as to the extent it changes what your intention may be.
> It is not the scale that matters here, in your example, but intent. With 1 joint, you want to smoke yourself. With 400, you very possibly want to sell it to others. Scale in itself doesnt matter, scale matters only as to the extent it changes what your intention may be.

It sounds then like you're saying that scale does indeed matter in this context, as using every single piece of writing in existence isn't being slurped up purely to learn, it's being slurped up to make a profit.

Do you think they'd be able to offer a usefull LLM if the model was trained only what what an average person could read in a lifetime?

It's common knowledge among LLM experts that the current capabilities of LLMs are triggered as emergent properties of training transformers on reams and reams of data.

That is intent of scale. To trigger LLMs to reach this point of "emergence". Whether or not it's AGI is a debate I'm not willing to entertain but everyone pretty much agrees that there's a point where the scale flips from a transformer being an autocomplete machine to something more than that.

That is legal basis for why companies would go for scale with LLMs. It's the same reason why people are allowed to own knives even though knives are known to be useful for murder (as a side effect).

So technically speaking these companies have legal runway in terms of intent. Making an emergent and helpful AI assistant is not illegal, but also making a profit isn't illegal either.

Right, but in the weed analogy, the scale is used as a proxy to assume intent. When someone is caught with those 400 joints, the prosecution doesn't have to prove intent, because the law has that baked in already.

You could say the same in LLM training, that doing so at scale implies the intent to commit copyright infringement, whereas reading a single book does not. (I don't believe our current law would see it this way, but it wouldn't be inconsistent if it did, or if new law would be written to make it so.)

It’s clear nvidia and every single one of these big AI corps do not want their AIs to violate the law. The intent is clear as day here.

Scale is only used for emergence, openAI found that training transformers on the entire internet would make is more then just a next token predictor and that is the intent everyone is going for when building these things.

I don't think that's clear at all. Businesses routinely break the law if they believe the benefits in doing so will outweigh the consequences.

I think this is even more common and more brazen when it comes to "disruptive" businesses and technologies.

>Businesses routinely break the law if they believe the benefits in doing so will outweigh the consequences.

I'm saying there's collective incentive among businesses to restrict the LLM from producing illegal output. That is aligned and ultra clear. THAT was my point.

But if LLMs produce illegal output as a side effect and it can't be controlled than your point comes into play here because now they have to weigh the cost + benefit as they don't have a choice in the matter. But that wasn't what I'm getting at. That's your new point, which you introduced here.

In short it is clear all corporations do not want LLMs to produce illegal content and are actively trying to restrict it.

Er no. I’ve read and remember hundreds of books in my life time. It’s not any more illegal based off scale. The law doesn’t differentiate whether I remember one book or a hundred then there’s no difference for thousands or millions.

No wishful thinking here.

> Er no. I’ve read and remember hundreds of books in my life time. It’s not any more illegal based off scale.

I'm not sure you understood what you said, but superficially it appears that you are agreeing with me?

Just because it's legal to read 100s of books does not make it legal to slurp up every single piece of produced content ever recorded.

We're talking man many orders of magnitude in scale there, and you're the one who pointed out that scale :-/

No I'm not agreeing with you.

>Just because it's legal to read 100s of books does not make it legal to slurp up every single piece of produced content ever recorded.

The law says you're perfectly in your legal right to slurp up every piece of content ever produced.

>We're talking man many orders of magnitude in scale there, and you're the one who pointed out that scale :-/

I'm aware, and the law doesn't talk about scale.

What is "scale" in this context? I think arguably 100 books over the span of decades is not "scale".

But tens (hundreds?) of thousands of books over the span of a few weeks? That's definitely "scale".

the law doesn't talk about scale, so either is perfectly legal. Memorizing a billion books vs memorizing one book. Same laws apply.
You can only read the book, if you purchased it. Even if you dont have the intent to reproduce it, you must purchase it. So, I guess NVDA should just purchase all those books, no?
Yep, I agree. That’s the part that’s clearly illegal. They should purchase the books, but they didn’t.
This is the bit an author friend of mine really hates. They didn’t even buy a copy.

And now AI has killed his day job writing legal summaries. So they took his words without a license and used them to put him out of a job.

Really rubs in that “shit on the little guy” vibe.

Obviously not; one can borrow books from libraries and read them as well.
That's true. But the book itself was legally purchased. So if nvidia went to the library and trained AI by borrowing books, that should be technically legal.
Do you have the same legal rights to something that you've borrowed as you do with something you've purchased, though?

Would it be legal for me to borrow a book from the library, then scan and OCR every page and create an EPUB file of the result? Even if I didn't distribute it, that sounds questionable to me. Whereas if I had purchased the book and done the same, I believe that might be ok (format shifting for personal use).

Back when VHS and video rental was a thing, my parents would routinely copy rented VHS tapes if we liked the movie (camcorder connected to VCR with composite video and audio cables, worked great if there wasn't Macrovision copy protection on the source). I don't think they were under any illusions that what they were doing was ok.

Well If I copied it word for word maybe, but if I read it and "trained it" into my brain then it's clearly not illegal.

SO the grey area here is if I "trained" an LLM in a similar way and not copied it word for word then is it legal? Because fundamentally speaking it's literally the same action taken.

But to train the models they have to download it first (make a copy)
You had to do this for reading too. The words were burned onto your retina as volatile memory before getting processed by your brain.

You retina likely overwrote it's "memory" as soon as you looked at something else, but that's no different than copying and deleting or the more apt analogy: streaming.

The law makes a distinction between storing it on a disk and just remembering the content. The latter is not a "copy" and not a subject of law:

> “Copies” are material objects, other than phonorecords, in which a work is fixed by any method now known or later developed, and from which the work can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device. The term “copies” includes the material object, other than a phonorecord, in which the work is first fixed.

> A work is “fixed” in a tangible medium of expression when its embodiment in a copy or phonorecord, by or under the authority of the author, is sufficiently permanent or stable to permit it to be perceived, reproduced, or otherwise communicated for a period of more than transitory duration. A work consisting of sounds, images, or both, that are being transmitted, is “fixed” for purposes of this title if a fixation of the work is being made simultaneously with its transmission.

https://www.copyright.gov/title17/92chap1.html

Interesting. How long is the transitory duration? The interpretation of that likely has yet to be determined by a court case and can evolve similar to how “all men are created equal” doesn’t just refer to men.

Seems to me a possible interpretation is just deleting the data after training is finished.

BS. Nvidia store use the copy for each training run, or do you really thing the just download it each time in real time for training?
But it’s not just about recall and reproduction. If they used Anna’s Archive the books were obtained and copied without a license, before they were fed in as training data.
You need to pay for the books before you memorize them
Partially true. I can pay for a book then lend it out to people for free.

The government is in full support of this "lending" concept, in fact they have created entire facilities devoted to this very concept of lending out books.

Okay, so go check out 500 TB worth of books from the library. I'll wait
If I’m rich enough to employ thousands of people I can hire each one of them to borrow as many books as possible then use all the books to train an AI. Perfectly legal. And also very possible.

Point being that the library prevents you from checking out 500gb because of logistical issues. First how can you carry all those books and how can they let other patrons in the library check out books if you grabbed that many? These rules aren’t enforced to prevent “scale” hence why my methodology got around the rules.

Great! Then it's perfectly legal.

As long as you obtain the books legally then it's legal

This really isn't that hard

So you’re wrong when you said you have to pay for the books. You don’t.
You can sit down at a library or Barnes and Noble and memorize for free.
You should read the sibling comments before leaving your unique "contribution"