| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by davexunit 715 days ago
	It's so obvious to me that machine learning models are derivative works of their training set. If they weren't, then why would these companies fight so hard to say otherwise? They need that training data to make their product, so they should pay the licensing fees for it! 10 years ago, when I worked on a machine learning model for my employer, it was unthinkable to train on data we did not have the rights to use. But now it's all fair game because OpenAI executives would make a little less money otherwise? They certainly aren't giving up any of their own copyright in return. It's a very transparent transfer of power and money from regular people to the bosses.

4 comments

doctorpangloss 715 days ago

> It's so obvious to me that machine learning models are derivative works of their training set.

Okay, but narrative creators watch movies and listen to music and read books too. Many do indeed "file the serial numbers off" other people's work and publish something else, that makes them money and not the original creators. Does one instance of "filing the serial numbers off" by one author mean that no authors anywhere are allowed to write any books as soon as they've read "a bunch" of other books? I get what you are saying, but it's not so obvious what the right policy is. It is very hard to make it consistent when "AI" is substituted with "human," and it's not so obvious if "AI" is a distinct class from human, because it is after all, something that only exists because a programmer somewhere wrote and operated it.

johnnyanmac 710 days ago

>Does one instance of "filing the serial numbers off" by one author mean that no authors anywhere are allowed to write any books as soon as they've read "a bunch" of other books

So far, pretty much all major actors are doing it. So yes, if everyone is abusing a rule, the ball is taken home.

>I get what you are saying, but it's not so obvious what the right policy is.

The one these companies spent 2 decades prior fighting to strengthen. Yes, I am enjoying the schadenfreude of companies in copyright lingo they used to launch thousands of lawsuits, and now they are on the other end when convenient to them to "steal IP".

copyright needs a major rework, but never interrupt your enemy in the middle of a mistake.

dmonitor 714 days ago

> it's not so obvious if "AI" is a distinct class from human

It is. It obviously is. It's the same reason that a person watching a movie and remembering it later is different than recording the movie with a camcorder.

> Ah but I made a robot that walks into theaters, buys a ticket, records the movie, leaves, and then recreates the movie at my home an infinite number of times. I didn't break the law, since a human could surely do the same thing with enough practice and effort.

Do you realize how ridiculous that sounds?

doctorpangloss 714 days ago

The policy isn’t written that way though. The policy doesn’t say anything about camcorders. So you’re right about camcorders. But the law says “copying” which is pretty abstract, the case law is really detailed, so it’s not so black and white. Nobody cares about your imaginary situations with robots - I basically agree with you that there needs to be a distinct law governing AI training, and that leads to a far more interesting and totally normative conversation about who, if anyone, is the good or the bad guys.

If the policy (via case law) becomes, expressly permissioned content only, there are no image generators. Some people may want that. But is that better than we were, in the current status quo, where we have them? I don’t think so.

Covenant0028 714 days ago

There is a difference, and AI companies understand it very well. All of them prohibit you from using their model to train other AI models. Microsoft takes it a step further and even prohibits you from trying to discover how the models work.

No human, however powerful, can prevent you from looking at their actions and learning from them. You can look at Obama's speeches for instance and learn how to craft certain messages for your own speeches. Nothing he can do to stop you from doing that.

And that is the key difference: AI models have been designed to privatize the process of learning, wherein they have unlimited freedom to learn from any human's work without compensating them from it, but humans or even other AI models cannot learn from an AI model.

This distinction IMO removes any right that the AI companies have to pretend that their models are people. They're not, the actions of the AI companies themselves show that.

slavik81 714 days ago

> All of them prohibit you from using their model to train other AI models.

Have they ever successfully enforced this clause in court? An equally valid resolution would be a conclusion that they don't actually have that power.

Retric 715 days ago

The issue here is that the AI model itself is a derivative work.

Further, they will very much recreate things the’ve seen many examples of. Recreating “Mona Lisa” isn’t a problem, but recreating “Iron Man” is. Individual artists may not know how to prompt the system to recreate their work, but looking at the training sets is going to help quite a bit.

doctorpangloss 715 days ago

No, the issue is that it makes outputs that compete with artists, and that is a problem if you go and make a fair use argument for appropriating copyrighted works.

If I were to secretly use an image generator, just for my own purposes, trained on public data, the plaintiffs would say it is just as illegal.

The rub is, do you know who else makes work that competes with artists? Other artists! It still kind of goes down on some vibesy stuff that I don't know if the law has a straight answer to. And for what it's worth, the Andy Warhol v. Goldsmith decision was about artists competing with other artists - this is the decision that has created an opening to challenge fair use. I just wonder why limit ourselves to the peculiarities of that case, why not open all forms of competition between artists to litigation over their influences and processes?

Retric 715 days ago

How the model is used isn’t relevant if creating it was already infringement. Training on works creates something of value and artists want to be able to prevent that training without compensation. There’s a long history of case law around just how much of someone’s work can be copied before it’s a problem. But here it’s literally the entire work being used so ‘how much’ is just everything.

The points you bring up are also relevant but artists don’t want to look through a billion individual images to see if that specific image happens to infringe on their work.

Edit: Wrote the response to a comment that got deleted before I posted presumably because I edited this one: IMO many commentators are getting this wrong.

“the less likely it is that the appropriation will serve as a substitute for the original work or its plausible derivatives, shrinking the market opportunities for the copyrighted work” https://www.supremecourt.gov/opinions/22pdf/21-869_87ad.pdf

The form of these models is very different, but the purpose is to create directly competing works. Each individual output may not directly infringe with a specific work, but the goal of the model very much is.

The comment brought up commentary about: https://en.wikipedia.org/wiki/Andy_Warhol_Foundation_for_the...

Workaccount2 714 days ago

It's only clear that training is a violation of copyright if you have a layman's understanding of how training works. There are no images stored in image models, just vectors that represent pixel relationships. You may call this fancy compression, but the ship runs aground if you try to "compress" a small set of images with a transformer - you just will get random noisy junk on the output.

Artists have a much firmer legal ground to stand on if they go after model output, but the goal is to kill image generators, not simply censor their output.

Think of it like this: If I splatter paint on a canvas, does jackson pollock have a copyright claim? Probably not, despite my creation being a product of training on his work. But it would be fair for my creation to be checked to see if it is too similar to one of his works.

Retric 714 days ago

just vectors that represent pixel relationships

Ask DALL-E 2 for Mona Lisa and it will produce something clearly derived from the original work. The ability to recreate items from the training set depends on how these systems are trained, but they are clearly capable of retraining enough to be problematic.

The Harry Potter the movies aren’t the original books, derivative works don’t imply something is the same just that it’s directly derived from something else.

> If I splatter paint on a canvas, does jackson pollock have a copyright claim?

If you’re trying to copy him then actually yes he would. Being inspired by a technique is fine, but the difference is less subtle than you might think.

Copyright cares how something was created, if you end up with ‘random’ patterns that happen to look suspiciously similar to another work it’s extremely unlikely that you came to that point randomly. What’s the odds you would pick the same 12 colors as someone else and apply them in the same order? 12 factorial isn’t a small number and that’s before considering the color selection.

paulddraper 715 days ago

Of course they are derivative.

The question is whether they are transformative.

Right or wrong, the bar for transformative use is probably lower than you think.

Artists are the beneficiaries of this, as they can riff on popular works for inspiration, recognizability, social commentary.

Given the existing case law, I don't see a ruling against AI companies as likely.

doctorpangloss 715 days ago

> Given the existing case law, I don't see a ruling against AI companies as likely.

Huh? Every corporate IP lawyer seems to think Andy Warhol Foundation v. Goldsmith has foreclosed the fair use defense, and that there isn't much to argue by AI companies to use work without express permission for training.

paulddraper 714 days ago

The use of the artwork was a TIME magazine cover, which has the same commercial purpose as the original photo owned by Goldsmith.

It's far from clear that case would apply here.

jncfhnb 715 days ago

> If they weren't, then why would these companies fight so hard to say otherwise?

What kind of looney logic is this?

paulddraper 714 days ago

IDK but it is wild.

s1artibartfast 715 days ago

needing the training data has zero bearing on if they are derivative works. "derivative works" it a term of art with a specific meaning.

I think the derivative work argument is a dead end. However, AI companies did violate use licenses when they first used the data for commercial purpose of training the models.