Hacker News new | ask | show | jobs
by EMIRELADERO 416 days ago
Then the results would be the same, and it would still be fair use. I have yet to see an example that demonstrates LLMs plagiarize by default or by tendency.

Your causality seems to be inverted here. You seem to be implying that "learning" (or the ingestion and retention of information for the same means) is banned by default for everything, but we decide to allow it for humans as the sole exception. This is not the case. Everything not prohibited is allowed, and "intermediate copies" are considered to be vital to fair use by the court system.

1 comments

No, it wouldn't. Because if I record "Revenge of the Sith", compress it, and then distribute it for free online, that's obviously not fair use.

Fair use is pretty complicated. Part of Fair Use is the "The Effect of the Use on the Potential Market for or Value of the Work", which already puts even human commercial endeavors in a tough spot. You can make it work, but you have to really try. Satire like Weird Al or whatever isn't competing with the music it's satirizing, the venn diagram between those markets barely overlap. But a lot of LLM use cases are explicitly meant to obsolesce and siphon value from the things they used.

Like, why go to Getty Images when you could instead go to the glorified database, which has ingested all of Getty Images, and acquire an indistinguishable stock photo for free?

The only reason we're even really entertaining this is because people continually draw parallels to humans. You see, it's not stealing from Getty. It's more like if someone saw Getty Images and then went out and took a photo in that same flat, boring style. Except nobody saw anything. And nobody went out an took a photo.

> The only reason we're even really entertaining this is because people continually draw parallels to humans. You see, it's not stealing from Getty. It's more like if someone saw Getty Images and then went out and took a photo in that same flat, boring style. Except nobody saw anything. And nobody went out an took a photo.

But unless your argument is that the photo outputs from the GenAI are literally equivalent to the training data, you would agree the end result is the same, right? Anyone can see that the images are not the training data stitched together, so it doesn't even really matter how it all works mechanistically, even though your description ("glorified database") is wrong.

Part of my point is that you don't need to produce literally equivalent output. Again, if I record and compress "Revenge of the Sith", there's literally zero pixels shared between my recording and the actual movie. Cool, so I can go upload it for free then right? No, I can't.

Can GenAI produce indistinguishable images to what's on Getty Images? If you write the prompt correctly, yes. I know because there are services where you can get generated stock images.

> Part of my point is that you don't need to produce literally equivalent output. Again, if I record and compress "Revenge of the Sith", there's literally zero pixels shared between my recording and the actual movie. Cool, so I can go upload it for free then right? No, I can't.

That's because you would be redistributing the actual material, just in a really roundabout way. GenAI models are not that, they're not a database and don't work like one.

> Can GenAI produce indistinguishable images to what's on Getty Images?

That doesn't matter because you can't copyright a style. From the point of view of copyright law, it would look like you were copying nothing proprietary/owned at all.

> That's because you would be redistributing the actual material, just in a really roundabout way.

Right, which I’m arguing is what LLMs do just in an even more roundabout way.

The technical details of LLMs don’t actually matter. We don’t really care if they’re a database or not. The question is do they reproduce the source material? And yeah, pretty much they do, in a lot of instances. Not all, but a lot.

To produce yet another analogy, imagine I have a service X. You can pay and I will give you any movie you want. You don’t know how I do it. Is this copyright infringement or not? I would say yes. Now let’s say I reveal the secret - I open up photoshop and painstakingly recreate the movie frame by frame. I might make a mistake here or there. Is this still copyright infringement? I think it is.

> To produce yet another analogy, imagine I have a service X. You can pay and I will give you any movie you want. You don’t know how I do it. Is this copyright infringement or not? I would say yes. Now let’s say I reveal the secret - I open up photoshop and painstakingly recreate the movie frame by frame. I might make a mistake here or there. Is this still copyright infringement? I think it is.

Okay, but that is not what's happening here. Demonstratably so. The fact that a model is technically capable of overfitting to certain very repeated points in the training data doesn't mean the entire thing has to be shot down. The non-infringing uses far outweigh the offending ones, by a lot.

If what you say is true, and they do outright copy a lot, then it should be pretty easy for any IP holder to sue anyone who misuses the model that way for copyright infringement on those specific outputs.