Hacker News new | ask | show | jobs
by chacham15 117 days ago
> The research findings “could present a challenge to those who argue that the AI model does not store or reproduce any copyright works,” said Cerys Wyn Davies, an intellectual property partner at law firm Pinsent Masons.

The defense to training with copyright is that it is the same as how humans learn from copyrighted material. The storage or reproduction is a red herring. Humans can also reproduce copyrighted works from memory as well. Showing that machines can reproduce copyrighted material is no different than saying that a human can reproduce copyright material that the human learned from.

The defense to actually reproducing a work is that in order to do so, the user has to "break" the system. It is the same as how you can make legal software do illegal things (e.g. screen recorder to "steal" a movie)

None of this is to say that these defenses are correct/moral; but rather that this article doesnt add any additional input into whether it is or isnt.

7 comments

> Humans can also reproduce copyrighted works from memory as well. Showing that machines can reproduce copyrighted material is no different than saying that a human can reproduce copyright material that the human learned from.

Ultimately this is a matter for the courts and the law, but I'd just like to point out that a human memorizing a work, reproducing it, and distributing it is just as much a copyright violation as doing a more mechanical form of reproduction.

There's a reason that fan fiction routinely falls afoul of copyright. There's quite a lot of case law in this area, and hand-waving "humans can do it too" doesn't really make for a strong argument. Humans get in trouble for it ALL THE TIME. The consequences can be fines, injuctions, or even criminal liability.

I'm not sure why you think AI gets off the hook here. Just because you like the outcome at the moment?

This isn't the defense you think it is. Performing a copyrighted work from memory - e.g. a piece of music, a poem, a story, etc - is still a copyright violation. There's no special protection for works that a human has memorized.
The key word in the HN headline is _can_.

Humans are not judged on the basis of what they _can_ do.

Reasoning about how to constrain tools on the basis of what they _could_ do, if e.g. used outside their established guardrails, needs to be very nuanced.

Correct; the ability of a model to reproduce source material verbatim does not necessarily make the model's existence illegal. However, using a model to do just that might very well present a legal liability for the user. I would be interested to see the extent to which models can "recite from memory" source code, e.g., from the various MS code leaks. Put another way, if I'm using LLM code generation extensively, do I need to run a filter on its output to ensure that I don't "accidentally" copy large chunks of the Windows codebase?
>There's no special protection for works that a human has memorized.

Who's liable for the copyright infringement if you can coax it out of a system? If you can bypass paywalls by using google's cache feature (or since they got rid of it, but using carefully crafted queries to extract the entire text via snippets), is google on the hook or the person doing it?

Both. If I sell obviously pirated CDs on the street corner, it's not only illegal for me to copy them and sell them, it's also illegal for my customers to buy them.
>it's also illegal for my customers to buy them.

Is it? There's plenty of people prosecuted for running illegal streaming sites and torrenting (which involves uploading), but I don't know of any efforts to crack down on non-distributors.

Just because someone doesn't get arrested does not mean something is legal
Yes. Both Google and the human in question.
1. How does this interact with the ruling that both google books (ie. large scale scanning of books without author's consent) and google snippets (the same, but for websites) have been ruled legal by the courts?

2. Google might not be the most sympathetic defendant, but what about libraries? They offer books to be borrowed, and some offer photocopiers. If you put the two together, you get a copyright infringement operation, all enabled by the library. Should libraries be on the hook too?

For #2 yes...you would be engaging in copyright infringement. The library, being on the hook, would probably ask you to stop if they noticed you copying full books. If not the first time, certainly on the second
>If you can bypass paywalls by using google's cache feature

that is quite different. Google serves (used to serve) to its users whatever the website presents to its crawler, it does not try to avoid paywalls or interact with the website in any capacity other than requesting information

The whole “humans also do this” isn’t a winning defence here. Humans and copyright has long history and so much law that it is easy to get confused.

The default assumption here seems to be that the system needs to be broken. This is similar to the Google defence. If a user intent is to search for a cracked software what can poor Google do about it? The answer is to make it even more difficult.

This is a defence also used by torrent sites using magnet urls. “We don’t host files” is the default defence. But then if these sites get hit with DMCA they are required to remove the magnet url.

So the article shows what the lawyer is saying. Despite claims that it is difficult to search for full books, it really isn’t so. It is trivial. When it goes to court and it will, AI models will be required to make it even more difficult and allow for a DMCA like takedowns.

> Humans can also reproduce copyrighted works from memory as well

That's simply not true. No humans can memorize entire novels, as this research proved these models do. And definitely not all of these novels, and code bases, and who knows what else all at the same time.

They absolutely can. Millions of people can recite the Quran verbatim, word for word. That's 77797 words. There is even a title for those people.

https://en.wikipedia.org/wiki/Hafiz_(Quran)

It's not far fetched to think that people could recite books just like an LLM. I don't know why they'd want to, but that's neither here nor there.

>No humans can memorize entire novels, as this research proved these models do.

Humans can however, remember entire songs, and songs are definitely long enough to be considered copyright protected. There is still a difference in scale, but that's not really relevant when it comes to copyright law. You can't be like "well humans are committing copyright infringement but since it's limited to a few hundred words we'll give it a pass".

It's not that you can remember a song and therefore copyright infringement when you sing.

For 99.999% of people that are singing a song, it's not a replacement for the original in any way shape or form, hard stop. Let's not pretend it could even get anywhere close.

For the last 0.001%, we would call it a cover and typically the individually doing a cover takes some liberties of their own, still making it not a replacement in any way. Artists are typically cool with covers.

>For 99.999% of people that are singing a song, it's not a replacement for the original in any way shape or form, hard stop. Let's not pretend it could even get anywhere close.

You realize that lyrics are often written by someone other than the actual singer, and whoever wrote the lyrics is entitled to compensation too? The "amateur singing isn't a replacement for the studio album" excuse doesn't work in this context. Also courts have ruled that lyrics themselves are protected by copyright.

https://en.wikipedia.org/wiki/Lyrics#Copyright_and_royalties

>Artists are typically cool with covers.

Artists being "cool" with something doesn't mean they're not violating copyright law.

Clearly the team, if it is a team, that is entitled to the copyright is entitled to the copyright of the song, that's a silly statement to make. Copyright belongs to some entity, obviously.

You were specifically calling out individuals singing a song, not publishing lyrics online. These are not the same thing. Again your distribution/consumption model matters here.

On artists being "cool" with it - if the copyright holder doesn't pursue you then does it matter? The only valid argument I would see here is if the copyright holder doesn't know about the infringement and therefore cannot seek remedies, but we can fish for illegal scenarios all day if we would like: that's not useful though.

>Clearly the team, if it is a team, that is entitled to the copyright is entitled to the copyright of the song, that's a silly statement to make. Copyright belongs to some entity, obviously.

>You were specifically calling out individuals singing a song, not publishing lyrics online. These are not the same thing. Again your distribution/consumption model matters here.

I'm not sure why you're so confidently dismissive here. I wasn't trying to claim that nobody owned the lyrics. I brought that point up because even in the case of an amateur singing a song, even if you accept the "for 99.999% of people that are singing a song, it's not a replacement for the original in any way shape or form" excuse, you're still infringing on the copyright of the lyrics, because it's a derivative work. Moreover it's unclear whether that excuse even works. If you make a low cost version of star wars, copying the screenplay exactly, that still seems like copyright infringement, even if "it's not a replacement for the original in any way shape or form".

>On artists being "cool" with it - if the copyright holder doesn't pursue you then does it matter?

Virtually nobody got sued for torrenting with a VPN on. Does that mean it's fair to round that off as being legal, because "if the copyright holder doesn't pursue you then does it matter"?

If I sing a copyrighted song, however absurd it may sound, I CAN, in fact, be sued by the copyright holder.
I also was skeptical, but musical works makes more sense for that argument. Their premise is still flawed, though
You can't pay a human to reproduce copyrighted material either.
But the crime in the human instance is the reproduction, not the storage. So the crime in the AI circumstance would not be in the training, but in prompting the output.

And of course AIs are excellent at taking direction, so:

If I prompt it with "Harry Potter, but Voldemort wins: dark, and Hermione is a sex slave to Draco Malfoy" and get "Manacled," that's copyright infringement, and on me, not on the LLM/training.

If I prompt it with "Harry Potter, but Voldemort wins: dark, and Hermione is a sex slave to Draco Malfoy, and change enough to avoid infringing copyright," and get "Alchemised," then that should be fine. I doubt the legal world agrees with me though.

> But the crime in the human instance is the reproduction, not the storage. So the crime in the AI circumstance would not be in the training, but in prompting the output.

I wouldn't be so sure, at least under US law. 17 USC 101 defines a "copy" as:

  [...] material objects, other than phonorecords, in which a work is fixed by any method now known or later developed, and from which the work can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device.
If I memorize a work what ends up in my brain is not a copy according to that definition because with current technology there is no machine or device which can be used to perceive, reproduce, or otherwise communicate it. The work can only be perceived, reproduced, or otherwise communicated by using my brain which is not a machine or device.

No copy in my brain means that memorizing the work cannot infringe the copyright owner's exclusive right to reproduce the work in copies.

An LLM, unlike my brain, is a machine or device which can be used to perceive, reproduce, or otherwise communicate the work and so the work stored in the LLM is a copy.

Training an LLM then, unlike a brain memorizing a work, makes a copy and so would be covered by the copyright owner's exclusive right to make copies.

That's going to need to be justified, probably by arguing fair use.

I'd argue your brain is that "machine or device" -- the fact that the storage and the playback mechanism are one and the same is irrelevant. The fact that you have to be willing/induced to replay the content back just makes you a worse machine :-)
Interesting argument but not likely to go far. As far as I can tell US copyright law has never been taken to include brains as machines or devices.

This is actually relevant in some real cases, namely improvised works. Attempts to claim copyright on improvised works that were not recorded have generally failed. If brains counted as machines or devices than the work inside the performer's head would be a recording and the work would have copyright.

That is one of the reasons it is usually recommended that musicians should record their live performances. That gets them copyright on anything they improvise during the show. Also it gets them copyright on that particular performance of their music, which helps them go after anyone who makes an unauthorized recording of the show. (Copyright is only automatic upon recording when the recording is by or under the authority of the creator).

Asking for copyrighted material isn't a crime. Producing copyrighted material is.

By the way, give me a digital copy of 28 Years Later. Please.

>The defense to training with copyright is that it is the same as how humans learn from copyrighted material.

Yeah, it's something people say but it is severely lacking in evidence and credibility.

What calculus?