| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by saulpw 497 days ago
	> So the models are legitimately not viable without massive copyright infringement. Copyright is not about acquisition, it is about publication and/or distribution. If I get a copy of Harry Potter from a dumpster, I can read it. If a company gets a copy of *all books from a torrent, they can use it to train their AI. The torrent providers may be in violation of copyright, and if the AI can be used to reproduce substantive portions of the original text, the AI companies then may be in violation of copyright, but simply training a model on illegally distributed text should not be copyright infringement.

6 comments

dkjaudyeqooe 497 days ago

> simply training a model on illegally distributed text should not be copyright infringement

You can train a model on copyrighted text, you just can't distribute the output in any way without violating copyright. (edit: depending on the other fair use factors).

One of the big problems is that training is a mechanical process, so there is a direct line between the copyrighted works and the model's output, regardless of the form of the output. Just on those terms it is very likely to be a copyright violation. Even if they don't reproduce substantive portions, what they do reproduce is a derived work.

saulpw 497 days ago

If that mechanical process is not reversible, then it's not a copyright violation. For instance, I can compute the SHA256 hashes for every book in existence and distribute the resulting table of (ISBN, SHA256) and that is not a copyright violation.

dkjaudyeqooe 497 days ago

That's actually within the other fair use factors. So your hash table is fair use because its transformative and doesn't substitute for the original work.

I edited my post to make it a bit clearer.

anticensor 497 days ago

It's actually even less than fair use, it's non-copyright use: one-way hashes are intentionally designed to eliminate the creative element and output random looking data.

gruez 497 days ago

>One of the big problems is that training is a mechanical process, so there is a direct line between the copyrighted works and the model's output, regardless of the form of the output. Just on those terms it is very likely to be a copyright violation. Even if they don't reproduce substantive portions, what they do reproduce is a derived work.

Google making thumbnails or scanning books are both arguably "mechanical". Both have been ruled as fair use.

aoanevdus 497 days ago

What’s a “mechanical process”? If I read The Lord of the Rings and it teaches me to write Star Wars, is that a mechanical process? My brain is governed by the laws of physics, right?

What if I’m a simulated brain running on a chip? What if I’m just a super-smart human and instead of reading and writing in the conventional way, I work out the LLM math in my head to generate the output?

dkjaudyeqooe 497 days ago

Anything a machine does. You can simulate whatever you like, but under the law it's not human so it's mechanical.

cycomanic 497 days ago

That's an interesting take, but false in a lot of juristictions. Even if we ignore question of if the model can distribute work, in many places even downloading content is illegal. Otherwise the person torrenting a movie would be totally in the clear, or thing about what MS would say if a company "just" downloads copies of Windows to use on their computers without ever distributing them.

gruez 497 days ago

>Otherwise the person torrenting a movie would be totally in the clear

Any examples of people being sued for merely downloading? "Torrenting" basically always involves uploading, even if you stop immediately after completion. A better test would be if someone was sued for using an illegal streaming site, which to my knowledge has never happened.

veggieroll 497 days ago

I mean, you're right in the abstract. If you train an LLM in a void and never do anything with the model, sure.

But that's not what anyone is doing. People train models so that someone can actually use them. So I'm not sure how your comment is helpful other than to point out that distinction (which doesn't make much difference in this case specifically or how copyright applies for LLM's in general)

_DeadFred_ 497 days ago

As long as someone give me the software software to run my business, that person might be in violation of copyright but I'm in the clear.

Simply running my business on illegally distributed copyrighted text/software/movie should not be copyright infringement.

layer8 497 days ago

If you buy a machine that prints copies of copyrighted books (built into the machine), and you use that machine and then distribute the resulting copies, and the machine didn't come with a license allowing you to do so, I'm pretty sure that you are liable as well.

At least some current AI providers, however, come with terms of service that promise that they will cover any such legal disputes for you.

itishappy 497 days ago

You might not be immediately liable, but that doesn't mean you're allowed to continue. I'd assume it's your duty to cease and desist immediately once it's pointed out that you're in violation.

tyfon 497 days ago

> Copyright is not about acquisition, it is about publication and/or distribution.

It would be interesting to see how this holds up in court.

"Your honor, I didn't watch the movie I downloaded, I only used it to train an AI."

I highly suspect it would not matter.

johnnyanmac 497 days ago

well I think that will be the final judgement. We'll treat training data more as distribution than as consumption. Things always get more complicated when you put stuff up for sale. I also can't necessarily get away with Making "Garry Botter" who got accepted into an Enchanter school and goes on adventures with Jon and Germione. Unless it's parody, you can only cut so close before you're just infringinng anyway despite making it legally distinct.

blibble 497 days ago

> Copyright is not about acquisition, it is about publication and/or distribution. If I get a copy of Harry Potter from a dumpster, I can read it. If a company gets a copy of *all books from a torrent, they can use it to train their AI.

"a person reading" and "computer processing of data" (training) are not the same thing

MDY Industries, LLC v. Blizzard Entertainment, Inc. rendered the verdict that loading unlicensed copyrighted material from disk was "copying", and hence copyright infringement