Hacker News new | ask | show | jobs
by albert180 888 days ago
I think the biggest issue is with publishing the datasets. Then people and companies would discover that it's full of their copyrighted content and sue. I wouldn't be surprised if they slurped in the whole Z-Library et Al into their models. Or Google their entire Google Books Dataset
1 comments

Somewhat unrelated, but here is a thought experiment...

If a human knows a song "by heart" (imperfectly), it is not considered copyright infringement.

If a LLM knows a song as part of its training data, then it is copyright infringement.

But what if you developed a model with no prepared training data and forced it to learn from it's own sensory inputs. Instead of shoveling it bits, you played it this particular song and it (imperfectly) recorded the song with it's sensory input device. The same way humans listen to and experience music.

Is the latter learning model infringing on the copyright of the song?

If a person plays a song similarly enough, then it is copyright infringment! Mere knowledge is irrelevant, it is the producing of copies (and also a few related actions) which is prohibited by copyright.
No language model plays a song either in the narrow sense, they just send a representation of the song to some other program (or human) that might play it.

Mere knowledge is irrelevant only because we don't (yet) have a mechanism to pry open one's brains and inspect the copying of songs within different parts of one's brain. Otherwise, mechanistically, besides one using silicon and other using wetware, they're pretty much doing the same thing.

> send a representation of the song

That is copying. If not the song itself, at the least a close derivative work.

That's my point. If you could pry open a human brain and decipher how it works, you'll see some representation of the song being sent around to various parts of the brain.
This depends, how many times does it need to hear the song to build up a reasonably consistent internal reproduction, and are you paying per stream or buying the input data as CD Singles - or just putting the AI in a room with the radio on and waiting for it to take in the playlist a few times ?
Let's assume it is in a room with a radio listening to music, and that the AI is "general purpose" meaning that it can also perform other functions. It is not the sole purpose of the AI to do this all day.

I see where you are coming from in trying to identify the source of the copyright. This would be important information if a human wanted to sue another human for re-producing copyright material.

However, does that apply here? Nobody hears a human humming a song and asks if they obtained that music legally. Should it be important to ask an AI that same question if the purpose of listening to the song is not to steal it?

The standards applied are exactly the same regardless of what tools are used. It doesn't matter if you're talking about a dumb AI, a general purpose AI, or a Xerox machine.

If you want an exception to copyright, you're going to want to start looking at a section 107 (of the copyright act) exception: https://www.copyright.gov/title17/92chap1.html#107

The reason someone walking down the street and humming a song is not a violation is because it very clearly meets all of the tests in section 107.

The biggest problem with feeding stuff through a black box like an LLM is it isn't easy for a human to determine how close the result is to the original. An LLM could act like a Xerox machine, and it won't tell you.

I think this conversion has corrected some misgivings I had about the AI copyright argument. My takeaway is;

Possession copyright material is not inherently infringing on a copyright. Disseminating copyright material is unless you meet section 107. AI runs afoul of section 107 when it verbatim shares copyright material from its dataset without attribution.

> AI runs afoul of section 107 when it verbatim shares copyright material from its dataset without attribution.

Technically, the AI doesn't run afoul. The person disseminating the copyrighted material does.

Not humming, but Don't we prevent singing songs sometimes? The birthday song was famously held up by ip law for some years right?
> If a LLM knows a song as part of its training data, then it is copyright infringement.

No it isn't. You can feed whatever you want into your LLM, including copyrighted data. The issues arise when you start reproducing or distributing copyrighted content.

>You can feed whatever you want into your LLM, including copyrighted data.

That's currently the subject of considerable legal debate.

https://edition.cnn.com/2023/07/10/tech/sarah-silverman-open...

That is mostly an issue of the latter, whether the service that Meta/OpenAI offers outputs content that is a violation of copyright. Technically, derivative works are a copyright violation, but if you're not distributing them, you normally have a good fair use argument, and/or nobody knows.