Hacker News new | ask | show | jobs
by Workaccount2 498 days ago
There are two different things when it comes to discussing training LLM's on "copyright" protected data, and I almost never see people differentiate.

1.) Training on copyright that is publicly available. You write a poem and publish it online for the world to read. That is your IP, no one else can take it an sell it, but they are free to read and be inspired by it. The legalitly of training on this is in the courts, but so far seems to be going in favor of LLMs.

2.) Training on copyright that is not publicly available. These are pretty much pirated works or works obtained by backdoor to avoid paying for them. Your poem is behind a paywall and you never got paid, yet the poem is known by the LLM. This is just straight illegal, as you legally must pay to view the work. However there might be conditions here too like paying for access to an archive and then training on everything in it.

5 comments

I never gave my poem to Facebook. My site is for humans. And there was absolutely no problem with that website being public, until Facebook et al wanted to move the goalpost.. again. Remember when companies started to claim that their abuse is on you, because you failed to publish the correct headers/robots.txt and their bot needs to be told the rules in specific language? And now we get the same attempt at making such distinction again, just this time its our fault for .. having a public website in the first place (should have operated a paywall, duh!)
3.) The company making an unauthorized copy of your work and storing it permanently in a giant corporate library of their own making which they refer to over and over.

This is distinct from (1) where the content is streamed or only ephemeral/incidental copies are made.

The very idea that LLMs are "inspired" by copyright material is so far beyond absurd I just don't know what reality you people live in. They are ingesting copyright material in order to re-use it. Yeah they remix it to add their own (incredibly annoying) tone but that's what they're doing.
good distinction

IMO there's a hack about this,

authors can claim that they allow for public use unless it's used for training LLMs. And all of training work would fall under 2 because they would be used against the copyright.

I think they would need to have some explicit contract every time they want to sell the book then, though. I don’t think I am bound by some random terms someone writes into a book I’m buying. Those probably are only binding if a reasonable person would notice them before sale.
If you arrive at the point of being able to buy that book, it means it has passed the publisher's hands and I would think, that the publisher was OK with those terms then, and limiting the usage of the text may in fact be effective. If it was self-published, then even more so.
But the license restriction would have to apply both to the publisher and the customer.

If I go to the bookstore, buy the book, make a scan, and train an LLM with it, how would you enforce your license as an author? The customer never knew that he shouldn’t have been allowed to train LLMs.

Edit: I think I misunderstood the original comment, I thought the idea was to sell books and restrict use for LLM training. If we’re only talking about stuff that’s publicly released, the restriction should be possible.

Whether you make a scan of it or not, the license applies to the IP, I guess (IANAL).

Whether the shop makes a scan should not affect you as the buyer of the actual book. What does the scan have to do with you?

Whether the author learns about that scan and perhaps training of some LLM using the scan or not, does not change the legality of it.

But the license doesn’t apply to me as a customer if I can’t be expected to even notice it. If I buy a book in a bookstore, no one would assume that training LLMs on it would be explicitly forbidden. And adding a note to the book would probably not be binding because no one is expected to read the legal notice in a book.
I'm not sure there's any legal distinction though.

Is a book publicly available? No, you have to purchase it. But once you do, you're legally allowed to let your friends and family and so forth read it too. As long as you don't sell copies of it (the "copy" part of "copyright"), or meaningfully take away the ability for the publisher to make money from sales (so you can't post it for the whole world to see on the internet).

And sure, there are lots of ToS for digital works, but are they actually enforceable? ToS can say you're not allowed to let anyone else read the book you purchased. But no court is going to say you can't lend your Kindle to your friend for them to read it too. Many ToS clauses are flat-out illegal.

Meta will argue that training on books is no different from reading all the books at a friend's house. That as long as Meta isn't reselling or making publicly available the original text, they're in the clear.

I don't know what the deal is in the relevant jurisdictions, but in Swedish copyright law, the provenance of the original matters ("lovlig förlaga").

This means that it's not legal to download a rip of e.g. a CD that was uploaded without consent, even if you own a copy.

(This exception to the general right to make copies for private use was added in 2005 to make downloading illegal -- previously, only uploading was infringing.)

I would assume just the act of downloading this content was illegal in the relevant US jurisdictions as well.

I believe the most famous cases in the US have only gone after the people sharing or seeding or uploading content. My ISP could care less what I download from use net but they will definitely care when I start seating.
But they are making unauthorized copies: Their training data set is analogous to private collection of duplicates.

What do you think copyright law(suits) would do if a regular person made copies of every book and movie and song they saw, placing the duplicate media in a room of their house?