Hacker News new | ask | show | jobs
by CamperBob2 454 days ago
(Shrug) We'll see what the courts say, Gary.

If training AI doesn't constitute fair use, you will lose more than you could ever possibly hope to gain. As will the rest of us.

Meanwhile, sublimate your dudgeon towards advocating for free access to the resulting models. That's what's important. Meta is not the company you want to go after here, since they released the resulting model weights.

4 comments

To point out the obvious.

Unauthorized copying (aka pirating) is definitely a copyright violation.

That appears to be a huge problem with the large models and training. They don't secure legal access to the materials they train on, and thus fail to compensate authors for their work.

AKA students are required to buy or otherwise obtain legal access to their text books(like checking the book out of the library).

Training AI should play the same rules humans students have to follow.

Obtaining copies of pirated works is not infringement. Unauthorized sharing is infringement but being on the receiving end of sharing is not (even if one is an active participant).
And to point out the obvious - it seems training is not unauthorized copying. (At least this is the current legal status quo)
This. I am not asking for a special 1000x fee for AI. Just pay the normal fee a human would have paid, but at least pay that.
Are you also willing to work for OpenAI for free then? Have you contacted them with such an offer?
As long as I have access to the resulting model, sure. I thought I made that clear. Copyright is not as important as reaching the next stage of our intellectual evolution. Current-gen AI may not be sufficient to reach that stage, but I believe it is a necessary step.

Like the author of this screed, my work went into training every major model. I get paid back every time one of those models helps me learn or do something. The injustice, if it happens, will occur when a few well-heeled players like OpenAI succeed in locking the technology up with regulatory capture or (worse) if a few greedy, myopic assholes render it illegal or uneconomical to continue development by advocating copyright maximalism.

It's like saying that because a student reads a textbook, they now have to work for the author for free?
Does fair use imply that pirating copyrighted material is ok?

I mean, it’s a serious question; I don’t see this as really connected.

As long as an AI can “understand” the content of a book and spit out a summary of it, or even leverage what it learned to perform further inference, I’d be inclined to say that this is fair use; a human would do the same.

But this has nothing to do with using pirated material for training, especially for some kind of commercial purpose (even if llama is free, they’re building on top of it) - I don’t see why it should be legal.

Fair use is literally that:

"Fair use" in copyright law allows limited, specific uses of copyrighted material without permission.

Hence, by definition, not "pirating".

I get the commercial/legal angle, but from the viewpoint of AI being something we as a society have an interest in developing, how should this work?

Do you want to severely limit evolution of models by having them pick (and buy) a tiny subset of all books?

Should every training run put money into a pool that gets paid out to every rights holder of every book that has ever been published?

Should Meta buy a physical or electronic copy of every book they want to use for training? That has zero impact on revenue for individual authors.

Would they be paid by word, by token, by book? This makes little sense. We don’t charge people for the knowledge they acquired while going to the library over 50 years, AI just squeezes this into weeks. Our legal framework simply doesn’t fit.

> Should every training run put money into a pool that gets paid out to every rights holder of every book that has ever been published?

That could actually work. Bearing in mind that all copyright laws are messy and terrible, this proposal is at least not impossible.

"Ever been published" means in the last 100 years.

Ok, 130 million books against $100M training costs. You charge an (unrealistic) 100% tax for the book usage. Each author will get less than a dollar. What is the point other than enriching publishing companies?
You mean, what is the point beyond paying those companies that made such books possible and available for their work? No other point, actually.
> Should Meta buy a physical or electronic copy of every book they want to use for training?

Yes, and probably, if training in parallel, multiple copies, just as multiple people will need multiple books.

Multiply this by the amount of GPUs and AI model providers, and the revenue impact is not zero.

Why should it be fair use? Why would being a derivative work not be OK? There is a massive corpus of public domain and FOSS works. Likewise plenty of permissively licensed government created datasets. There is no reason why any corpus created from these sources is insufficient.
> Why would being a derivative work not be OK?

That's not even the real problem. It's a problem, yes, but not the real problem. The problem is that before they could train the model on the book, they had to copy the book from somewhere. Is it ok to make illegal pirated copies of a copyrighted book to train your model? I think that's the issue we are dealing with here.

Whether it is ok to create a derivative work or not is beside the point.

The problem is that before they could train the model on the book, they had to copy the book from somewhere.

That, in itself, raises kind of an interesting point.

Right now there's a post on the front page where people are exercising conspicuous outrage because ChatGPT rendered a good Indiana Jones likeness in response to a vague query asking for a 1930s archaeologist with a bullwhip. Was that particular response generated by ChatGPT because it "copied" Indiana Jones? Or because it was influenced by the same pulp fiction stories and deeply-embedded cultural archetypes that led Spielberg and Lucas to create the character in the first place?