Hacker News new | ask | show | jobs
by camillomiller 672 days ago
But it IS a moral question. As it is IMMORAL to just go and steal content to train AI and then let the lawyers to take csre of it later
1 comments

AI is not "stealing content" and it is not immoral.

This is literally the same line from record companies and Hollywood during the 80s and 90s, and they were rightfully mocked then. It is baffling how so many people are just repeating it now.

I've noticed the same. I think the actual reason behind is similar in both cases, something like "this is theft because it means I won't get paid" — which isn't really how that works but also explains Musk's X advertising lawsuit.

Like your respondant, I find the artists more sympathetic despite finding the argument itself bad.

I don't know if "theft because I'm not getting paid" describes artists' sentiment so much as "theft because someone else is making money on a tool that's 'using' my work/labor". Which I still think is an argument that requires opening a Pandora's box of IP law that should not be opened (because in the long run it won't benefit non-corporate artists anyway), but...
In general, I think the coming replacement of many jobs is a serious issue for society and one that deserves serious attention.

That said, the intellectually dishonest arguments irk me to no end and I'm simply tired of them. The artists upset over AI are more sympathetic than the RIAA (It's really quite hard not to be), but this stuff really wears at my sympathy.

And, unfortunately, it's impossible to actually tackle any issues unless without moving past the bad arguments.

> That said, the intellectually dishonest arguments irk me to no end and I'm simply tired of them.

I, too, get annoyed by People Who Are Wrong On The Internet; I'm trying to become more stoic, as I don't want to end up like my father.

I don't know if this will pass, or if the pro/anti AI split is going to be as permanent as the economic left/right split, or the libertarian/authoratarian split.

> This is literally the same line from record companies and Hollywood during the 80s and 90s, and they were rightfully mocked then. It is baffling how so many people are just repeating it now.

The difference with AI is that AI takes a 100% bit for bit copy and uses it, rather than humans who just use their impression to be inspired.

Yes, illustrators are notorious IP rentiers like the Hollywood studios and the RIAA. It’s the tech billionaires that are the victims of their vile, unjust monopoly tactics. These are coherent thoughts that demonstrate why it’s a good idea to argue from analogies.
I'm not praising tech billionaires, nor am I attacking RIAA/Hollywood or online artists as entities. Please don't start crafting strawmen. I'm criticizing the "stealing" argument because I don't find it logically sound; it doesn't matter who's saying it.

I am still more than willing to have a civil debate around the argument itself.

Why is it stealing to analyze images? I would be more convinced if AI used a fixed database during generation, or if it was considered a standard, acceptable practice to reproduce training data as "new" generations.

You don’t find the stealing argument logically sound because you immediately frame the theft as “analyzing” to suit your own narrative and then demand people engage with it, while proceeding to make further spurious claims like…

> I would be more convinced if AI used a fixed database during generation

Wow, I didn’t know that model weights, an elaborately compressed form of their training data, rewrote themselves every time they were invoked. Or that it’s only theft if I stole data from a fixed database to build my own service.

AI training is literally analyzing. That is how it works. Properly trained models (i.e., ones that aren't overparameterized or overfit) do not just "elaborately compress" training data as this is not possible. For example, you cannot compress 1 billion images into 1 billion parameters, and expect to retrieve them later.

If objective facts are "my own narrative", then no rational discussion can occur.

Oh well, you should tell the folks at DeepMind and Meta about these objective facts then so they don’t waste any more time doing research:

https://arxiv.org/html/2309.10668v2

Maybe apply for a job there too, since you’re obviously so far ahead of everyone in understanding this problem space.

You absolutely can compress a subset of a billion images into a billion parameters if you throw out all but a thousand. Is it no longer copyright infringement if you also run enough irrelevant data through your algorithm alongside the images you’re stealing?
Don’t mind me, I’m just going to ‘analyse’ this UHD movie and produce a 480p video file in a different codec whose bits are almost entirely unlike those in the original and throws out almost all the information from the original. I’ll put it on a RAID array with thousands of others, mangling the bits of the ‘analysis’ even further. The right ‘prompt’ may cause the model to produce some imagery very similar to some of its ‘training data’ however.

You can use whatever weasel words you want, but bits go in and fewer derivative bits come out in both cases.

This is a strawman.

The purpose of video codecs is to reproduce the original video. If you do that, it's copyright infringement.

AI models should not reproduce the original images. The output will not be something that already exists.

Purpose and intent matters.

You’re right, purpose and intent matters, and the intent is to profit from the work of others without their permission and without crediting or compensating them in any way.
It has to do with what the resulting model is used for. It gets particularly dodgy if its commercial usage, because most if not all of the data used for training wasn’t licensed for that, making for a “laundering” effect.

Though I also think there’s an argument to be made that images need to be properly licensed to even be “analyzed” in this way, because it’s ultimately an unauthorized copy even if it involves picking the image apart and obfuscation. They were published with the intent of being viewed by the public, not for being reproduced in any shape or form.

It’s not baffling at all if you consider the point of copyright in the first place. That is, to promote the progress of useful arts and sciences. One does, and one does not.
Yup. Generative AI is the useful part of "useful arts and sciences". 99%+ of content that goes into training those models is, on its own, useless and worthless, and the greatest value by far it can bring to society is to be part of the training dataset. That also applies to art that may have had some value when published, but now languishes in obscurity - AI is giving it a second life, a way to benefit society far more than originally did.

So yeah, if we're going by the (idealized version of) intent of copyright, it stands strongly on the side of AI.

EDIT:

And before someone complains that SOTA models are trained and owned by private parties -- copyright is "promoting the progress of useful arts and sciences" by literally giving private parties a monopoly to make money off art as an incentive.