Hacker News new | ask | show | jobs
by const_cast 416 days ago
No, because the entire argument hinges on the fact that LLMs learn, which is like humans learning, so it's transformative. That only works if you consider learning or transformation to be something that does not rely on the human spirit. Which, actually, most people do not believe. And it's pretty difficult to argue - we don't even know how learning works for people.

A lot of people just jump to LLMs learning like it's a foregone conclusion. Mm... no. You need to convince people of that. You'll find if you talk to non-tech people, they're not just going to believe you when you say that.

Why isn't an LLM more akin to a database or a compression algorithm? Why is it closer to human learning? After all, humans are humans and we have the exclusive right and power to determine what is human and what isn't. And database and compression algorithms are computer programs, of the same kind as an LLM.

2 comments

Then the results would be the same, and it would still be fair use. I have yet to see an example that demonstrates LLMs plagiarize by default or by tendency.

Your causality seems to be inverted here. You seem to be implying that "learning" (or the ingestion and retention of information for the same means) is banned by default for everything, but we decide to allow it for humans as the sole exception. This is not the case. Everything not prohibited is allowed, and "intermediate copies" are considered to be vital to fair use by the court system.

No, it wouldn't. Because if I record "Revenge of the Sith", compress it, and then distribute it for free online, that's obviously not fair use.

Fair use is pretty complicated. Part of Fair Use is the "The Effect of the Use on the Potential Market for or Value of the Work", which already puts even human commercial endeavors in a tough spot. You can make it work, but you have to really try. Satire like Weird Al or whatever isn't competing with the music it's satirizing, the venn diagram between those markets barely overlap. But a lot of LLM use cases are explicitly meant to obsolesce and siphon value from the things they used.

Like, why go to Getty Images when you could instead go to the glorified database, which has ingested all of Getty Images, and acquire an indistinguishable stock photo for free?

The only reason we're even really entertaining this is because people continually draw parallels to humans. You see, it's not stealing from Getty. It's more like if someone saw Getty Images and then went out and took a photo in that same flat, boring style. Except nobody saw anything. And nobody went out an took a photo.

> The only reason we're even really entertaining this is because people continually draw parallels to humans. You see, it's not stealing from Getty. It's more like if someone saw Getty Images and then went out and took a photo in that same flat, boring style. Except nobody saw anything. And nobody went out an took a photo.

But unless your argument is that the photo outputs from the GenAI are literally equivalent to the training data, you would agree the end result is the same, right? Anyone can see that the images are not the training data stitched together, so it doesn't even really matter how it all works mechanistically, even though your description ("glorified database") is wrong.

Part of my point is that you don't need to produce literally equivalent output. Again, if I record and compress "Revenge of the Sith", there's literally zero pixels shared between my recording and the actual movie. Cool, so I can go upload it for free then right? No, I can't.

Can GenAI produce indistinguishable images to what's on Getty Images? If you write the prompt correctly, yes. I know because there are services where you can get generated stock images.

> Part of my point is that you don't need to produce literally equivalent output. Again, if I record and compress "Revenge of the Sith", there's literally zero pixels shared between my recording and the actual movie. Cool, so I can go upload it for free then right? No, I can't.

That's because you would be redistributing the actual material, just in a really roundabout way. GenAI models are not that, they're not a database and don't work like one.

> Can GenAI produce indistinguishable images to what's on Getty Images?

That doesn't matter because you can't copyright a style. From the point of view of copyright law, it would look like you were copying nothing proprietary/owned at all.

> That's because you would be redistributing the actual material, just in a really roundabout way.

Right, which I’m arguing is what LLMs do just in an even more roundabout way.

The technical details of LLMs don’t actually matter. We don’t really care if they’re a database or not. The question is do they reproduce the source material? And yeah, pretty much they do, in a lot of instances. Not all, but a lot.

To produce yet another analogy, imagine I have a service X. You can pay and I will give you any movie you want. You don’t know how I do it. Is this copyright infringement or not? I would say yes. Now let’s say I reveal the secret - I open up photoshop and painstakingly recreate the movie frame by frame. I might make a mistake here or there. Is this still copyright infringement? I think it is.

> That only works if you consider learning or transformation to be something that does not rely on the human spirit.

Even changes made using simple non-ML algorithms can be transformative according to fair use doctrine, like the thumbnailing of images done by search engines. It's not meant in some spiritual sense.

The reason that’s okay is because you aren’t competing against the initial source material. A thumbnail on Google for “Revenge of the Sith” is not a replacement for watching the movie.

But, a lot of AI products are specifically and explicitly designed to obsolesce the thing they trained off. No need to go to Encyclopedia X or the NYT, this has the same content.

> The reason that’s okay is because you aren’t competing against the initial source material.

I don't mean to claim that search engine image thumbnailing is like-to-like in every consideration, just that it demonstrates there's no "human spirit" required in order to qualify as "transformative" as far as fair use is concerned. Search engine image thumbnailing has been found to be transformative, for instance in Perfect 10, Inc. v. Amazon.com, Inc.: "Google's use of thumbnails is highly transformative."

And, though I'm probably being pedantic here, I think it's important to distinguish that the other fair use factor you allude to is not whether you're "competing against" the original work, but specifically the effect of your use on the market/value of that original work. For example if your documentary uses a clip from a TV show and also happens to air in the same time-slot as that TV show - the extent you compete/displace market for the TV show in general (even as you would had you not included the clip of it) is not what's under consideration, but rather only the additional extent you displace its market specifically due to inclusion of that clip.

Because of that, I'd claim that some machine-learning-based tool that partially displaces the market for a work it was trained on (for instance, Google Translate displacing the market for a translated version of a book) might still be seen reasonably favorably under the market impact factor, so long as the extent it displaces that work is largely independent of whether it has trained on that work specifically (such as if the translation tool could already provide a decent translation of the original book even before having trained on its translated version).