Hacker News new | ask | show | jobs
by api 825 days ago
The entire AI industry is powered by piracy at a massive scale. Very little training data is properly licensed or compensated. It's just more obvious with open models because we can investigate them. Closed models are sausages and we don't know what went in.

Download a movie and you can get sued or your Internet connection terminated, but pirate the entire collective output of humanity and sell it back to us from behind a paywall and that's fine.

I have more sympathy for Stability here because at least they opened the models. IMHO models trained on not-properly-licensed (pirated) data should at the very least not be copyrightable and should be public domain. (These piracy enterprises are aware of this as a possible legal outcome in some jurisdictions, so the whole AI safety bullshit performance is an attempt to scare people about open models to head off the potential of questionably-trained models being declared uncopyrightable and forced to be released.)

8 comments

> IMHO models trained on not-properly-licensed (pirated) data should at the very least not be copyrightable and should be public domain.

My understanding is that ML model weights cannot be copyrighted as an original creative work. They are trade-secrets and protected through contracts but once leaked to third parties it’s not a copyright violation to use/distribute.

Whether the model is actually a derivative work of the training data is another interesting question.

Or is my theory off here?

The main argument I have seen (which is also OpenAI's in their legal briefs) is that it is fair use. The idea of "fair use" is that you are conceding that you are infringing by creating a derivative work, but it's still okay. Implied in the fair use argument is that it is a derivative work.
> Implied in the fair use argument is that it is a derivative work.

You can get all LLMs to spit out almost exact copies of known IP visuals from movies and games. For instance, with Dalle-E and Midjourney, it's relatively easy to get similar pictures from film and game studios. Those are copies with minor changes. It would be hard to argue otherwise in court. The same happens with ChatGPT spitting out verbatim passages from New York Times articles.

>and sell it back to us from behind a paywall and that's fine.

That's the sticking point. If it's an open tool for humanity's benefit being created given back to us, that's one thing... but to sell it back to us...

With that said, piracy is close to what's happening... but I think we should be careful classifying where/what exactly is the matter. I reason I think that matter's lies may be down the end of a slippery slope, or it may be straight ahead of us... the future is hard to know. If we classify it poorly we may unintentionally cause human(post/trans-human) right's issues {if I upload my consciousness to a digital mind, I don't want archaic laws to dominate what I can see/compute based on the material of which I'm made}.

These companies need to make money because they took VC money and training these things take 10s if not 100s of millions of dollars.

Also nice to see the complete nonsense of digital mind uploading on hackernews vis a vis this discussion. If that happens we'd need to change a lot of laws anyway.

The mind uploading is just one of many potential outcomes.

To me the more interesting concern: we can't seem to agree on the bare minimum requirements for sentience/experience. Maybe the 'bare minimum' is 'electricity runs through it'. It may be that these LL/SD/ML models are having an 'experience' without the proper memory/state/internal-control to achieve sentience/consciousness.

Law's need to change, that's for sure (look at copyright).

And taking VC money to do anything Open seems like a trap. There are government grants... but yeah... there exists a whole host of (related or similar) problems in that.

Thought crime in 2084: the name given for a crime for which the only evidence is a scan of your brain.

e.g. "You imagined someone naked! That's a non-consensual deepfake of intimate personal imagery!"

>The entire AI industry is powered by piracy at a massive scale.

ARRRRR..

This is a grey area still for me. It's a neural network. It works similar to our brains work, but more consistent. It's doesn't seem like piracy to me. If an artist was really into Salvidor Dali, and happened to imitate his surrealist style, it would not be considered piracy. In fact, this is how art has evolved over the centuries. Each relevant artist in the past has incrementally contributed to what we call art today.

I feel like the people unwilling to accept that AI may impact their career are more worried about putting food on the table than anything else, which is very understandable, but it's just the cost of progress.

The bigger problem we need to deal with is how to retrain and provide job placement who are affected by disruptive technologies. We've really failed the public on this in the past and I don't think it's worth nerfing emerging tech just to keep people employed. This is not the first or last time this has happened, and it's going to be more frequent as technology advances.

> It's a neural network. It works similar to our brains work, but more consistent.

Irrelevant and incorrect.

> It's doesn't seem like piracy to me.

It's pretty indisputably piracy, whether or not it's legal/fair use/whatever. Many of the training sets included material like the books3 corpus which was downloaded to a server somewhere. That is simply piracy, doesn't matter why they downloaded it.

I believe many artists rightly refuse to accept this threat to their livelihoods because it was built on their labor. It's so fucking rich to see people patronizingly suggest that this is just an economic problem and those artists better just figure out a new profession.

You built a commercial product on unlicensed data. Do you actually think the law is going to agree that that's fair use?

> It's pretty indisputably piracy, whether or not it's legal/fair use/whatever.

Ah, this is obviously some strange usage of the word 'indisputably' that I wasn't previously aware of.

> I believe many artists rightly refuse to accept this threat to their livelihoods because it was built on their labor.

This model is trained from scratch using only public domain/CC0 and copyright images with specific permission for use: https://huggingface.co/Mitsua/mitsua-diffusion-one

Does it change anything?

If all the other models were deleted, and this was the only one left, and all future models also had to be similarly licensed, would it change even one single point?

Even if it was the only remaining model and this kind of licensing a requirement for all future work, artists would still be automated out of their highly skilled yet poorly paid profession. It still sucks. There's still no nice way to convey that.

> You built a commercial product on unlicensed data. Do you actually think the law is going to agree that that's fair use?

What do you think the Google search engine is, if not a commercial product built on unlicensed data?

The courts go both ways on this specific question with Google depending on the exact details, because nothing in law is as easy or simple as the clear-cut, goodies-vs.-baddies, black-and-white morality play you want this to be.

The fact that Stability AI have not yet been sued out of existence in a simple open-and-shut court case about copyright infringement ought to have demonstrated both this point, and also that the question "is this piracy?" is, in fact, disputable.

https://huggingface.co/datasets/P1ayer-1/books-3/discussions...

It seems incredible to me to suggest that piracy wasn't involved in the collection of training data, regardless of your view on the morality or legality of it. Datasets like books 3 indisputably contained copyrighted content that was being distributed without permission from the rightsholder. That's just the definition of piracy. If we can't agree on that then I'm not sure what we're doing here.

More materially to this discussion, yes, it would absolutely make a difference if the AI was only trained on licensed content. I wouldn't use it but I wouldn't have a problem with it. The issue is specifically that much of the work being used without permission is being used to replace the people who made that work, and is being used without permission. If the model is based on ethically acquired data, it would be less able to reproduce the style of specific artists. Imo, there would be more room for both kinds of art in this case.

I'm also aware that it's not a clear cut case legally but I think AI advocates and tech enthusiasts think it's a lot more likely that AI will win in court than the actual chances. Napster took years to litigate and was eventually shutdown. There's a really good discussion about this on the decoder podcast between actual lawyers.

> https://huggingface.co/datasets/P1ayer-1/books-3/discussions...

https://transparencyreport.google.com/copyright/overview?hl=...

> It seems incredible to me to suggest that piracy wasn't involved in the collection of training data, regardless of your view on the morality or legality of it. Datasets like books 3 indisputably contained copyrighted content that was being distributed without permission from the rightsholder.

Is the Google search engine piracy?

https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

https://en.wikipedia.org/wiki/Perfect_10,_Inc._v._Amazon.com....

https://en.wikipedia.org/wiki/Field_v._Google,_Inc.

https://9to5google.com/2016/04/27/getty-images-google-piracy...

https://www.reuters.com/article/idUSN07281154/

> That's just the definition of piracy. If we can't agree on that then I'm not sure what we're doing here.

It literally isn't the definition of piracy.

Piracy exists only with regard to the legal definition: "Copyright infringement (at times referred to as piracy) is the use of works protected by copyright without permission for a usage where such permission is required, thereby infringing certain exclusive rights granted to the copyright holder, such as the right to reproduce, distribute, display or perform the protected work, or to make derivative works."

Even this definition annoys a lot of people, but I will ignore the whole "it's not theft because you're not depriving the original owner of anything" as a case of taking an analogy too literally.

> More materially to this discussion, yes, it would absolutely make a difference if the AI was only trained on licensed content. I wouldn't use it but I wouldn't have a problem with it. The issue is specifically that much of the work being used without permission is being used to replace the people who made that work, and is being used without permission. If the model is based on ethically acquired data, it would be less able to reproduce the style of specific artists. Imo, there would be more room for both kinds of art in this case.

Congratulations on being consistent, almost all the artists and authors are still permanently out of work.

Even ignoring that style isn't covered by copyright (because you could reasonably argue instead that it's a trademark and/or design right issue), most artists are already extremely poor due to oversupply by other humans.

> I'm also aware that it's not a clear cut case legally but I think AI advocates and tech enthusiasts think it's a lot more likely that AI will win in court than the actual chances. Napster took years to litigate and was eventually shutdown. There's a really good discussion about this on the decoder podcast between actual lawyers.

FWIW, I know better than to trust my own beliefs[0] about law, as (free) ChatGPT is simultaneously bad, and yet vastly better at it than me.

Likewise, I think (but hold the view weakly) the mere existence of AI at even the level it was before ChatGPT's first release, is going to force a radical change in the nature of IP laws — even then these models were too good-and-cheap for countries to not allow them, while also breaking a lot of the current assumptions about everything: https://benwheatley.github.io/blog/2022/10/09-19.33.04.html

[0] I really ought to get a T-shirt printed with "Wittgenstein was wrong!"; there are so many different ways I don't accept one of his famous quotes: https://philosophy.stackexchange.com/questions/72280/first-p...

> The entire AI industry is powered by piracy at a massive scale.

Forget about AI. Instead it is almost the entire art industry, wholesale!

The semi-professional online art commissioning market is almost entirely copyright infringing fan art works, being sold without permission of IP owner.

Yes, fan art is infringing. Especially when it is sold. And if you go to a convention center, to the artists section, you will see that over half of the booths are straight up selling other people's IP without permission.

This is the case for conventions, online art commissions, etsy/handmade items, all of it.

Its all illegal, all infringing, and the only reason why anyone cares now is because someone else can do the same thing that others have been doing for decades, but quicker and cheaper.

I'm glad someone brought this up. Artists, especially fan artists, will only hurt themselves if they advocate for classifying transformative works as infringing. Fan art has been so normalised that people have forgotten that it used to be considered legally dubious. Better to advocate for reskilling and social safety nets; automation affects everyone, not just them.
Fair use has 4 factors. Transformative is one of them. Recently, courts have gotten much more interested in a different factor, the "commercial intent" factor. While fan art is less transformative than AI training, it's not commercial and it's not competing with the original work (if anything, it enhances the market for the original work). Generative AI models are both commercial products and very successful competitors with the original works they used for training.
> it's not commercial and it's not competing with the original work

Yes it is and yes it does.

"Fan art" is "fan" in name only.

If you read back on my original post, you will see that I am talking about almost the entire online professional art commissions market.

From online, to convention centers, and more.

All of this is commerical and all of this competed with the IP owners.

People just sell other people's IP in all of these places.

I'm perfectly fine with getting rid of AI... if all fan artists paid the statutory and actual damages for their infringing activities.
What actual damages are there for fan art? Can you prove that work done by a fan artist would otherwise have been done by the original creator and that fan artists are costing sales of the original works they create fan art for?
> Very little training data is properly licensed or compensated.

Could it ever be the case, I wonder, if we could trust/enforce/believe that a model had so abstracted what it learned from the training inputs such that the model was not a derived work from them?

I've seen the examples where the model is able to reproduce recognizable characters from popular media. Those look like they might be "just" overfitting? While I can see that as desirable from the point of view of being able to create a picture of "Robocop shopping for diapers". But maybe we could compromise and converge to a point where AI art isn't quite so demonized and instead is seen as a useful tool.

I think it's obviously problematic that these companies are deriving value from millions of people without compensating them, while creating a product that competes with those masses.
You are describing the original meaning of "cultural appropriation", like when jazz and rock & roll were copied from Black American culture and sold.
I am describing "copyright infringement"
If you are selling something, and no one is buying it, the value you have generated is zero. If you put something online and you did not bother to understand this material can potentially be used by a third-party on account of its loose licensing, then who's to blame?
But the licensing isn't loose in many (most?) cases we're aware of. Merely making an image publicly available online doesn't give the viewers rights to do whatever they want with it under our copyright laws.
Well, I suppose the keyword here is "most?" because the burden of proof lies with the prosecution, the legal gymnastics of coming up a reasonable argument to this will be interesting.
They could at least make an effort to purchase licenses from all non-open content, comply with open licenses, and exclude content otherwise. They aren't making anything more than the most lame token effort because they don't care.
But that's just it - if we believe that what the model learns from the training material is abstract enough, they shouldn't license the content at all. Humans learn from and are inspired by art all the time. They create new works that are not considered derived works, despite there being obvious influence. Could we conceive of the same circumstance being possible with machine learning?
If we go down this road right now, we are allowing superintelligent AI powered corporations to front-run the entire human race and sell everything we think back to us.

It's not about theory of mind stuff. It's about just compensation of living human beings.

Well, with the status quo, there's no license required to train on the greatest works of art from centuries past.

I recognize some of the concerns about AI but I don't think pinning hopes on copyright law will deliver anything remotely resembling a remedy to the problems you bring up.

Are you talking about training a human or training an artist?

Downloading copyrighted data at huge scales to use in your commercial software product is pretty substantially different than an art student studying a reference.

1848 had a publication that might interest you.

I think the manifesto is missing some important aspects about game theory and human nature, and for some of that theory of mind is indeed very important, and that's why this particular political experiment didn't work out in the end despite the good intentions and that several aspects have become globally accepted.

I’ve read it but I think this case is much more unambiguous. Workers are paid; Marx would argue they are systematically underpaid and disempowered.

In this case the workers are not paid at all. Their work is not even acknowledged. It’s closer to cultural appropriation but quite a bit more unambiguous than that as well since this isn’t people learning from people. This is mass uncompensated value harvesting.

The number of hands benefiting here are incredibly tiny. In theory you could have one human owning the entire human mind and renting it back. This is the danger of present generation AI, not Skynet scenarios, and it anything the sci-fi stuff distracts us from this.

It’s like an information theory equivalent of today’s shoplifting epidemic except there are tiny gangs of only a few shoplifters able to run at Mach 10 and shoplift from every store in the country in days.

> The entire AI industry is powered by piracy

Just like all art. When you draw something you don't cite every single thing you've seen and experienced in life that inspired your drawing and style. Nor did you own or pay royalties to all that inspiration either.

>When you draw something you don't cite every single thing you've seen and experienced in life that inspired your drawing and style.

Oh please. There is an astounding degree of nuance and context missed in your example here.

I'm kind of down to let people follow this dead end reasoning since it's legally irrelevant. Makes it much easier to disregard.
> questionably-trained models being declared uncopyrightable and forced to be released

I think uncopyrightable is a likely outcome, but where are you coming up with forced to be released?

If there is a model which is, for the sake of the argument, absolutely definitely and unambiguously powered by piracy at a massive scale, then the act of forcing it to be released is going to necessarily entail all of the possible harms from that specific act of IP piracy.

IMO, if a model is deemed to be such, all copies of that model should be destroyed. Actual copyright law allows for the destruction of equipment used for copyright infringement, and those laws were written in the days where this meant "a printing press".

> the whole AI safety bullshit performance

The people who care about AI safety have been loudly warning about it for so much longer than these companies and models have existed, that they roll their eyes at newspapers using stock photos from Terminator to illustrate the discussion.

> The entire AI industry

Also includes self-driving cars, spam filters, medical diagnosis tools, …