| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by JW_00000 537 days ago

I don't understand why it's even a question that Meta trained their LLM on copyrighted material. They say so in their paper! Quoting from their LLaMMa paper [Touvron et al., 2023]:

> We include two book corpora in our training dataset: the Gutenberg Project, [...], and the Books3 section of ThePile (Gao et al., 2020), a publicly available dataset for training large language models.

Following that reference:

> Books3 is a dataset of books derived from a copy of the contents of the Bibliotik private tracker made available by Shawn Presser (Presser, 2020).

(Presser, 2020) refers to https://twitter.com/theshawwn/status/1320282149329784833. (Which funnily refers to this DMCA policy: https://the-eye.eu/dmca.mp4)

Furthermore, they state they trained on GitHub, web pages, and ArXiv, which are all contain copyrighted content.

Surely the question is: is it legal to train and/or use and/or distribute an AI model (or its weights, or its outputs) that is trained using copyrighted material. That it was trained on copyrighted material is certain.

[Touvron et al., 2023] https://arxiv.org/pdf/2302.13971

[Gao et al., 2020] https://arxiv.org/pdf/2101.00027

3 comments

gameshot911 537 days ago

Critically, by torrenting they also directly distributed the copywritten material itself. That is a standalone infringement separate from any argument about trained LLMs.

jimjimwii 537 days ago

They could have only leached and refrained from sharing any part of copyrighted data. If i were to commit something as risky as this, that is what i would do.

zelphirkalt 537 days ago

Then it would need to be determined, whether that is the case or not. Did every single machine they used have the configuration for only leeching and no seeding? The company is liable for what its employees on the job. If only one employee was also seeding ... that could be a very interesting case.

crazygringo 537 days ago

> Did every single machine they used have the configuration for only leeching and no seeding?

I would certainly assume so. It's incredibly obvious that's what you would want to do from a legal standpoint.

> If only one employee was also seeding ... that could be a very interesting case.

The torrenting wouldn't be done casually by employees acting on their own. And it's not like multiple employees are doing it simultaneously, unsupervised, on their personal computers.

This is part of an official project. They'd spin up a machine just to download the torrent, being careful to disable seeding.

This is Meta. They have lawyers involved and advising. This isn't a teenager who doesn't fully understand how torrenting works.

mvdtnz 537 days ago

Did you not read the article? There are quotes from Meta employees doing exactly what you claim they wouldn't do.

> This is part of an official project. They'd spin up a machine just to download the torrent, being careful to disable seeding.

From the article:

> "Torrenting from a corporate laptop doesn’t feel right," Nikolay Bashlykov, a Meta research engineer, wrote in an April 2023 message, adding a smiley emoji. In the same message, he expressed "concern about using Meta IP addresses 'to load through torrents pirate content.'"

You also claim they would be "careful to disable seeding" but we know they did in fact seed (and anyone who uses private trackers knows they couldn't get away with leeching for very long before being kicked off):

> Meta also allegedly modified settings "so that the smallest amount of seeding possible could occur," a Meta executive in charge of project management, Michael Clark, said in a deposition.

alphan0n 536 days ago

Seeding can be trivially faked to trackers.

https://github.com/slundi/RatioUp

https://github.com/anthonyraymond/joal

http://ratiomaster.net/

The smallest amount of seeding possible would be metadata, presumably not subject to copyright.

qup 537 days ago

And punishing them in the normal manner will be an incredibly small slap on the wrist, and do absolutely nothing to help us find out what will play out in court regarding a fair-use defense on training AI with copyrighted material.

lucianbr 537 days ago

Isn't there a "fruit of the poisoned tree" kind of thing? Sounds to me quite similar to the situation where you would murder your parent and get to keep the inheritance, even if you are convicted of murder. Inheriting stuff isn't illegal, yet, I think most jurisdictions would not allow you to keep it in this case.

There should be a problem with stuff obtained through illegal means, even if having that stuff is in principle legal. In this case, copyrighted material.

Obviously they would argue that having the data is only a consequence of the download part, and that part is legal. What I see is that these situations are always complicated, and if you're rich enough, you get to litigate the complications and come out with a slap on the wrist or maybe even clean hands, while if you are an ordinary citizen, you can't afford to delve into the complexities and get punished.

These days I'm starting to give up on the whole concept of the legal system being fair. They're not even pretending anymore.

Workaccount2 537 days ago

There are two different things when it comes to discussing training LLM's on "copyright" protected data, and I almost never see people differentiate.

1.) Training on copyright that is publicly available. You write a poem and publish it online for the world to read. That is your IP, no one else can take it an sell it, but they are free to read and be inspired by it. The legalitly of training on this is in the courts, but so far seems to be going in favor of LLMs.

2.) Training on copyright that is not publicly available. These are pretty much pirated works or works obtained by backdoor to avoid paying for them. Your poem is behind a paywall and you never got paid, yet the poem is known by the LLM. This is just straight illegal, as you legally must pay to view the work. However there might be conditions here too like paying for access to an archive and then training on everything in it.

edelbitter 537 days ago

I never gave my poem to Facebook. My site is for humans. And there was absolutely no problem with that website being public, until Facebook et al wanted to move the goalpost.. again. Remember when companies started to claim that their abuse is on you, because you failed to publish the correct headers/robots.txt and their bot needs to be told the rules in specific language? And now we get the same attempt at making such distinction again, just this time its our fault for .. having a public website in the first place (should have operated a paywall, duh!)

Terr_ 533 days ago

3.) The company making an unauthorized copy of your work and storing it permanently in a giant corporate library of their own making which they refer to over and over.

This is distinct from (1) where the content is streamed or only ephemeral/incidental copies are made.

mvdtnz 537 days ago

The very idea that LLMs are "inspired" by copyright material is so far beyond absurd I just don't know what reality you people live in. They are ingesting copyright material in order to re-use it. Yeah they remix it to add their own (incredibly annoying) tone but that's what they're doing.

farukozderim 537 days ago

good distinction

IMO there's a hack about this,

authors can claim that they allow for public use unless it's used for training LLMs. And all of training work would fall under 2 because they would be used against the copyright.

echoangle 537 days ago

I think they would need to have some explicit contract every time they want to sell the book then, though. I don’t think I am bound by some random terms someone writes into a book I’m buying. Those probably are only binding if a reasonable person would notice them before sale.

zelphirkalt 537 days ago

If you arrive at the point of being able to buy that book, it means it has passed the publisher's hands and I would think, that the publisher was OK with those terms then, and limiting the usage of the text may in fact be effective. If it was self-published, then even more so.

echoangle 537 days ago

But the license restriction would have to apply both to the publisher and the customer.

If I go to the bookstore, buy the book, make a scan, and train an LLM with it, how would you enforce your license as an author? The customer never knew that he shouldn’t have been allowed to train LLMs.

Edit: I think I misunderstood the original comment, I thought the idea was to sell books and restrict use for LLM training. If we’re only talking about stuff that’s publicly released, the restriction should be possible.

zelphirkalt 537 days ago

Whether you make a scan of it or not, the license applies to the IP, I guess (IANAL).

Whether the shop makes a scan should not affect you as the buyer of the actual book. What does the scan have to do with you?

Whether the author learns about that scan and perhaps training of some LLM using the scan or not, does not change the legality of it.

crazygringo 537 days ago

I'm not sure there's any legal distinction though.

Is a book publicly available? No, you have to purchase it. But once you do, you're legally allowed to let your friends and family and so forth read it too. As long as you don't sell copies of it (the "copy" part of "copyright"), or meaningfully take away the ability for the publisher to make money from sales (so you can't post it for the whole world to see on the internet).

And sure, there are lots of ToS for digital works, but are they actually enforceable? ToS can say you're not allowed to let anyone else read the book you purchased. But no court is going to say you can't lend your Kindle to your friend for them to read it too. Many ToS clauses are flat-out illegal.

Meta will argue that training on books is no different from reading all the books at a friend's house. That as long as Meta isn't reselling or making publicly available the original text, they're in the clear.

Snild 537 days ago

I don't know what the deal is in the relevant jurisdictions, but in Swedish copyright law, the provenance of the original matters ("lovlig förlaga").

This means that it's not legal to download a rip of e.g. a CD that was uploaded without consent, even if you own a copy.

(This exception to the general right to make copies for private use was added in 2005 to make downloading illegal -- previously, only uploading was infringing.)

I would assume just the act of downloading this content was illegal in the relevant US jurisdictions as well.

wil421 537 days ago

I believe the most famous cases in the US have only gone after the people sharing or seeding or uploading content. My ISP could care less what I download from use net but they will definitely care when I start seating.

Terr_ 533 days ago

But they are making unauthorized copies: Their training data set is analogous to private collection of duplicates.

What do you think copyright law(suits) would do if a regular person made copies of every book and movie and song they saw, placing the duplicate media in a room of their house?

unraveller 536 days ago

Trained on doesn't mean significant inclusion in the final state.

Is it truly a violation of copyright when a user hacks out bits and pieces of easily restyled raw data points from a model to look samey? what about if it takes two models? Might be time to accept humans are just cooked in their ability to discern attempts at direct plagiarism - just as it is hard to discern Sky voice from Her voice.