Hacker News new | ask | show | jobs
by CuriouslyC 816 days ago
If ChatGPT verbatim reproduces copyright content in a way that isn't defendable as fair use, OpenAI can be sued. Of course, that would be just one instance of copyright violation, which isn't that big of a deal, so in order for the rights holders to make it sting they'd have to prove that a very large number of people were prompting in this very specific way with the intent of piracy.

On the flip side, it wouldn't be hard to put guardrails on chatgpt output so that if too large a percentage of an answer is verbatim, it's blocked.

4 comments

> If ChatGPT verbatim reproduces

Copyright covers "derivative works." Verbatim is absolutely not a requirement for infringement.

If you take a copyrighted image and modify it, even to the point where it's unrecognizable, if the image is being used in the same way (i.e., isn't a "transformative use"), then it's still a derivative work.

Yes, you are likely to get away with it if you're not caught. But that doesn't mean what you're doing is considered fair use, just that you won't get sued.

Thing is, every piece of text generated by ChatGPT is incrementally using every character of training data. So legally speaking, everything it produces is arguably a derivative work of ALL of the training data.

Generative AI isn't even a legal gray area; under current law, there's no blanket exception for "how much" of a copyrighted work is used. At best there's a fair use _guideline_ that lists, as one of four criteria, the amount and nature of the copyrighted work used. But really it's the entirety of millions of copyrighted works being used to generate the models, and those works _can_ be reproduced verbatim in many cases, proving that the works are encoded into the model.

Generative AI is only permitted because there's big money behind it along with associated lobbyists. And there are many in-flight lawsuits trying to shut down both GPT and various art-generating AIs.

Maybe they'll change the law. Maybe courts will side with the AI companies. But until then, it seems obvious to me that anyone arguing that generative AI based on models built with copyrighted works is completely legal is using motivated reasoning.

I understand OpenAI is a US company, but this is a US-centric view. This is especially since TFA is about a Brazillian operation.

> under current law, there's no blanket exception for "how much" of a copyrighted work is used

Under fair dealing laws, there are. [1] Though, as always, if commercial fan art is legal, then so should something that uses only a couple bytes of information per work, bar overfits.

> But until then, it seems obvious to me that anyone arguing that generative AI based on models built with copyrighted works is completely legal is using motivated reasoning.

It is completely legal in the EU, Japan, South Korea and Singapore. [2]

[1] https://libhelp.ncl.ac.uk/faq/43267

[2] https://www.reedsmith.com/en/perspectives/ai-in-entertainmen...

Your link re: Fair Dealing guidelines does NOT make it 100% legal. For one, the ENTIRE works are encoded into the model--not a part of them. For another, those are just guidelines, not explicit exceptions, just like Fair Use in the US. It's all very hand-wavy, even more so in the UK, apparently, so there's no way you can list those guidelines and say that anything is clearly allowed.

Your second link means it's legal for them to CREATE THE MODEL. This is true in the US as well: The model is a clearly transformative use of the data.

But as soon as the model produces works in the same use category as the original work (code -> model -> code, for instance, or image -> model -> image), it is no longer transformative.

If you understand the law and the technology, it's clearly generating derivative works.

Entire works are encoded in the model in the same way that if I cut up a document into individual words and put it in a bag with a bunch of other documents, if I was a no life loser I could spend a long time "recreating" the document from individual words. The bag of cutout words is NOT copyright violation though.
I'm wary of how hard the law is likely to stick to e.g. "verbatim," which is to say, it implies that there is a meaningfully "creative" step that the computer is doing for purposes of escaping "infringement?"

Let's say I take a copyrighted picture, make it into a jigsaw puzzle and leave out a few pieces; I can't reproduce the original, but that's still certainly infringement.

If the correct assembly of the jigsaw puzzle was not hte original picture, but a rearrangement of the original picture with arguable satire/social commentary (such as people's heads being where their crotches should be) then that becomes fair use.
No dispute there; and this will get us to what will be the ultimate question: An AI thing, or a human, could both end up with a result that looks like what you're describing. The question will be -- should these two things be seen as different, legally?

I'm fairly certain they should be seen as different, on their face, from a public policy point of view, i.e. I'm presently very comfortable ducking the question of "can they think," and for the present, assume they do not -- otherwise you're essentially saying that non human AI tools are "humans" for the purpose of copyright infringement.

> On the flip side, it wouldn't be hard to put guardrails on chatgpt output so that if too large a percentage of an answer is verbatim, it's blocked.

It wouldn't be hard conceptually, but it would be a copyright violation unless OpenAI could establish a novel kind of fair use distinct from the AI training fair use they rely on for ChatGPT not to ve a copyright violation no matter what output it produces, since what it would involve is building a database that is a mechanical cooy of all the copyright-protected works in ChatGPTs training set, and integrating it as part of the commercial ChatGPT product, and consulting it using some kind fof full-text search each generation from ChatGPT to verify that no passage of sufficient length was reproduced verbatim.

Not necessarily. Youtube has fingerprints of copyright works for this exact purpose, and it works fine.
Youtube Content ID is based on a specific agreement with the individual content owner that permits the specific use. Which works for Youtube because its for UGC, not content Youtube generates.
What we need is traceability from learning data to final output in AIs and cite those as a source/store in the metadata for the produced output. That way there is no question as to what works were consulted, and people can check to see if their copyrighted work was used without permission by the LLM/diffusion model.

I understand this is a hard problem, but lots of tech needs to solve hard problems, and if AI was anything but a plot by the billionaire class to obsolete the need to pay professionals to do work, it would be required. As the people in power benefit from it, nothing will probably be done on this front.