Hacker News new | ask | show | jobs
by bugglebeetle 672 days ago
Yes, illustrators are notorious IP rentiers like the Hollywood studios and the RIAA. It’s the tech billionaires that are the victims of their vile, unjust monopoly tactics. These are coherent thoughts that demonstrate why it’s a good idea to argue from analogies.
1 comments

I'm not praising tech billionaires, nor am I attacking RIAA/Hollywood or online artists as entities. Please don't start crafting strawmen. I'm criticizing the "stealing" argument because I don't find it logically sound; it doesn't matter who's saying it.

I am still more than willing to have a civil debate around the argument itself.

Why is it stealing to analyze images? I would be more convinced if AI used a fixed database during generation, or if it was considered a standard, acceptable practice to reproduce training data as "new" generations.

You don’t find the stealing argument logically sound because you immediately frame the theft as “analyzing” to suit your own narrative and then demand people engage with it, while proceeding to make further spurious claims like…

> I would be more convinced if AI used a fixed database during generation

Wow, I didn’t know that model weights, an elaborately compressed form of their training data, rewrote themselves every time they were invoked. Or that it’s only theft if I stole data from a fixed database to build my own service.

AI training is literally analyzing. That is how it works. Properly trained models (i.e., ones that aren't overparameterized or overfit) do not just "elaborately compress" training data as this is not possible. For example, you cannot compress 1 billion images into 1 billion parameters, and expect to retrieve them later.

If objective facts are "my own narrative", then no rational discussion can occur.

Oh well, you should tell the folks at DeepMind and Meta about these objective facts then so they don’t waste any more time doing research:

https://arxiv.org/html/2309.10668v2

Maybe apply for a job there too, since you’re obviously so far ahead of everyone in understanding this problem space.

You absolutely can compress a subset of a billion images into a billion parameters if you throw out all but a thousand. Is it no longer copyright infringement if you also run enough irrelevant data through your algorithm alongside the images you’re stealing?
Don’t mind me, I’m just going to ‘analyse’ this UHD movie and produce a 480p video file in a different codec whose bits are almost entirely unlike those in the original and throws out almost all the information from the original. I’ll put it on a RAID array with thousands of others, mangling the bits of the ‘analysis’ even further. The right ‘prompt’ may cause the model to produce some imagery very similar to some of its ‘training data’ however.

You can use whatever weasel words you want, but bits go in and fewer derivative bits come out in both cases.

This is a strawman.

The purpose of video codecs is to reproduce the original video. If you do that, it's copyright infringement.

AI models should not reproduce the original images. The output will not be something that already exists.

Purpose and intent matters.

You’re right, purpose and intent matters, and the intent is to profit from the work of others without their permission and without crediting or compensating them in any way.
It has to do with what the resulting model is used for. It gets particularly dodgy if its commercial usage, because most if not all of the data used for training wasn’t licensed for that, making for a “laundering” effect.

Though I also think there’s an argument to be made that images need to be properly licensed to even be “analyzed” in this way, because it’s ultimately an unauthorized copy even if it involves picking the image apart and obfuscation. They were published with the intent of being viewed by the public, not for being reproduced in any shape or form.