Hacker News new | ask | show | jobs
by nilsbunger 402 days ago
There’s something called a substantive transformation test in copyright law. When you write a summary of a book, you don’t infringe on copyright because it’s a “substantial transformation”. This goes along with the idea that you can copyright the text but not the ideas it expresses.

When model training reads the text and creates weights internally, is that a substantial transformation? I think there’s a pretty strong argument that it is.

4 comments

No transformation is needed.

The point here is that book files have to be copied before they can be used for training. Copyright texts typically say something like "No unauthorised copying or transmission in any form (physical, electronic, etc.)"

Individuals who torrented music and video files have been bankrupted for doing exactly this.

The same laws should apply when a corporation downloads torrent files. What happens to them after they're downloaded is irrelevant to the argument.

If this is enforced (still to be seen...) it would be financially catastrophic for Meta, because there are set damages for works that have been registered for copyright protection - which most trad-pubbed books, and many self-pubbed books, are.

> have been bankrupted for doing exactly this.

Only if they seeded the data and some other entity downloaded it, i.e. they hosted the data. In a previous article I believe it was called out that Meta was being a leecher (not seeding back what they downloaded).

It's the hosting that gets you, not the act of downloading it.

> It's the hosting that gets you, not the act of downloading it.

However, people have been prosecuted for not even hosting a torrent, but merely providing a link to where people can find it.

e.g. https://torrentfreak.com/operator-of-popcorn-time-info-site-...

I would like to expand on this, since it seems to be a common misunderstanding. Lets imagine a hypothetical situation where one friend loans a book to another, who then makes a copy of it.

The lender owns the book, and it is within his rights to loan it to whoever he wants. That is legal. Making this illegal would end libraries.

The borrower is well within his rights to accept the book, and as the current owner he is even allowed to make a copy of the book (see the famous TIVO case). Making this illegal would end backups and format/time shifting.

When the borrower returns the book, he keeps the copy. Oh no! Surely he must now become a criminal? Nope. Possessing an unauthorized copy is also not illegal, despite what many copyright holders would like you to believe. Making this illegal would also criminalize a lot of legitimate format/time shifting, again see the famous TIVO case.

If the borrower were to loan his homemade copy to someone else THEN it would finally become illegal.

Nothing about AI changes any of this.

Don't read too much into what I am saying. I am not even talking about the AI piece.

I download a torrent with movie that I didn't pay for. If I don't allow to seed it, then I don't get in trouble. If I let it seed either during the download process or after, I'd get a DMCA notice if that torrent/magnet link was getting tracked.

I don't need a hypothetical book, that is just how it works if I were to download illegally obtained documents/media.

As technical as people are in this thread, easy to tell when folks didn't have their parents wondering why they were getting scary letters from the ISP.

If you made a durable copy of a book in your example to keep for yourself and use later that's already a grey area. But no one does it with books. People do it with other media tho, and big surprise get prosecuted for it. As you may know, in developed countries people get served notices for torrenting

But if you make books contents available online via some service that regurgitates its contents you would be totally sus because you can be considered in a business of selling derivative works.

Do you have any case law (other than Tivo or VHS time-shifting) that relates directly to books?
There was a relatively famous Google case regarding their digitization of books without the authors consent in 2015. Although it's not a perfectly analogous to this situation.

In Googles case they were digitizing the books (that they did not own), and publishing snippets for search users to help them find books and other material that weren't indexed on the web. The court found they had that right, but did place some pretty strict limits on them.

Still, Google was allowed to keep their database of scanned material despite not owning the originals.

Link: https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

According to your link and this comment https://news.ycombinator.com/item?id=43899406 Google's scanning project was ruled fair use because it was "transformative" and didn't harm the market for the works. It allowed searching within books that was otherwise impossible at the time, but by not providing the full text of the books it didn't meaningfully reduce sales.

Someone photocopying a book to read on the toilet (and leave the original on their nightstand) isn't engaging in transformative use. They're also harming the market for the work because if they hadn't made this photocopy, they would've had to buy a second copy of the book to get the same benefit.

This is a leap in the argument. We've gone from the right to use a work to "unless the result is identical or close to it, we have full rights to all works.".

Seems like a big gap there.

It's COPYright. It has to be very close to the original to be covered by copyright. Hence the name.
They copied the work when they made the training set.
My understanding is copyright is about distribution rights and not making a copy. Seeding falls under distribution.
Your understanding is incorrect.
Even if you argue the LLM's are merely summarizing content, they still had to illegally download that content in the first place. The model can't read and simmarize the texts unless the text was illegally downloaded and copied. Piracy isn't suddenly legal just because you promise to delete the movie you downloaded after watching it.
The counterargument to that is model training is impossible without making copies. That's not true for humans.
That's not really true. Models train (in greatly simplified way) by being shown an excerpt and being told to guess the next token from the excerpt. They push around their weights until the token they output matches the next token in the excerpt. Then the excerpt is no longer needed. You can think of it like the article is loaded, the LLM plays this token guessing game through it, then the article is discarded. On the face of it this is what happens, but it gets hairier depending on how exactly this process is done. But it is seemingly not far removed from how humans consume content (acquire, read, discard), hence the legal blur.
> by being shown an excerpt [of copyrighted material]

How is this done? Are bits not written into RAM or disk? Are they not sent between machines in a training cluster? That's copying.

> it is seemingly not far removed from how humans consume content

Except that humans don't make full copies to RAM, or disk or paper.

The is a bar of usage built into the law, otherwise everyone who reads this wired article is violating copyright by making a full copy to their computer. Generally making non-lasting copies is fine, otherwise the internet wouldn't work.

AI doesn't need lasting copies to train, however I don't know what the actual implementation is. But if it's ruled that they can only use copyrighted data if it's not stored for more than the time it would take a human to consume, It wouldn't really cripple the models, but perhaps make training more logistically challenging.

It's important to understand that models are not data archives. They are statistical constructs made from getting quizzed, that uses human made content to generate the quiz questions.

> otherwise everyone who reads this wired article is violating copyright by making a full copy to their computer

Wired explicitly sent that article to their computer for the purposes of reading it so it's not a copyright violation.

> Except that humans don't make full copies to RAM, or disk or paper.

Images on your retina form exact copies.

They are scanned and translated into impulses that are then sent to a first set of "neural columns" - that's an exact copy.

This is then connected to the visual cortex by the two most high bandwidth links in the human body ("the optical nerve", there's 2 of them of course, always wondered why everybody insists on using the singular). Why would you have that high bandwidth link unless to create verbatim copies.

The way those columns are structured also very strongly suggests they make carbon copies, which they then make available on the "brain bridge" (which is probably at least vaguely similar to the "attention matrix" of a transformer). If it does work like that, that's also a verbatim copy.

The only way "humans don't make full copies to RAM" is that humans don't have separate RAM. The processing power is colocated with the processing, even on a microscopic level. You know, what everybody knows is the best way of doing things even in silicon, it's just incredibly impractical if you can't rebuild your circuit every time there's a slight change to the instructions your "computer" carries out (the brain is not a "Von Neumann architecture", except it kind of is when it regrows connections. But in the short term it isn't)

> that's an exact copy.

Not for the purposes of copyright law.

> is that humans don't have separate RAM [or disk]

And that turns out to be incredibly important. Humans can't create a lasting, shareable copy of a copyrighted work by consuming it.

Sure they can. You can learn a copyrighted work by hard, even indirectly, then quickly duplicate it by hand. Mozart was originally famous for making a business out of that.
It is different thing. When you copy data into computer's RAM, that might be copying as defined in law [1]:

> Using software almost always involves creating copies, even though many of these copies only exist for a very short time. For example, executing a program means copying it from the hard disk into RAM so that the CPU can interpret the instructions. Because of this, the right to run a program is considered to fall under the copyright of the author.

For comparison, when a human looks at the letters, there is no copying.

Also, models can reproduce text verbatim which proves that they store it.

So it is unfair when ordinary folks got sued for this and Zuckerberg wants to get away with a million times larger violation. He must go directly to jail.

[1] https://www.iusmentis.com/copyright/software/rights/

It's also true for humans, you memorize only parts of what you read and see but you still had to view the whole thing first.

The computer model is working differently of course but functionally it's the same idea.

God I hate this conversation so much. These cases have nothing to do with how the brain works.
That's why the keyword here is functionally.