Hacker News new | ask | show | jobs
by philwelch 530 days ago
No, you’re missing the point of copyright. The point of copyright is to protect an exclusive right to copy, not the right to produce original works influenced by previous works. If an LLM produces original works that are influenced by the training data, that is not a violation of copyright. If it reproduces the training data verbatim, it is.
3 comments

    > The point of copyright is to protect an exclusive right to copy, not the right to produce original works influenced by previous works.
As I understand, the definition of "the right to produce original works influenced by previous works" has been a slowly moving target in my lifetime. Think about the effects of the album Paul's Boutique by Beastie Boys. They went wild with sampling and paid very little (zero?) to license those samples. Then, there were a bunch of court cases in the US that decided that future samplers needed to license the samples from the original authors. However, the ability to create legal, derivative works is usually carefully defined in copyright law. Can you comment on this matter vis-a-via LLMs?

    > If an LLM produces original works that are influenced by the training data, that is not a violation of copyright.
I'm pretty sure if an LLM creates Paul's Boutique 2.0 in 2025 using incredible number of samples, then someone cannot sell it (or use it in a YouTube video) without first licensing those samples. I doubt very much someone could just "hide behind" an LLM and claim, "Oh, it is original, but derivative, work, created by an LLM." I doubt courts would allow that.
> I'm pretty sure if an LLM creates Paul's Boutique 2.0 in 2025 using incredible number of samples, then someone cannot sell it (or use it in a YouTube video) without first licensing those samples. I doubt very much someone could just "hide behind" an LLM and claim, "Oh, it is original, but derivative, work, created by an LLM." I doubt courts would allow that.

This isn’t how LLM’s work though. Samples are just that, literal samples that are copied from one work to another verbatim. LLM’s use training data to construct a predictive model of which tokens follow each other. You probably could get an LLM to use samples deliberately if you wanted to, but this isn’t how they typically work.

Regardless, at that point you’re just evaluating the claim of copyright infringement based on the nature of the work itself, which is exactly what I’m advocating, versus presuming that all LLM output is necessarily copyright infringement if any copyrighted material was used in training.

i weirdly agree with you, but also want to point out that “influenced by the training data” is doing some very heavy lifting there.

exactly how the new work is created is important when it comes to derivative works.

does it use a copy of the original work to create it, or a vague idea/memory of the original work’s composition?

when i make music it’s usually vague memories. i’d argue that LLMs have an encoded representation of the original work in their weights (along with all the other stuff).

but that’s the legal grey area bit. is the “mush” of model weights an encoded representation of works, or vague memories?

I don’t really think it matters because you can just compare the output to the input and apply the same standard, treating the process between the two as a black box.
did you just call me a black box? :/

not sure how i feel about being reduced down to that as a human being.

As far as I’m concerned you are a black box. Just as I’m a black box from your perspective. In principle I could come over and vivisect your brain if you’d like, but I doubt you’d be interested, and I wouldn’t really want to incur the legal liability even if you were.

Besides, “black box” just means that your internal mental life and cognitive mechanism is opaque to me. It’s not like I’m calling you a p-zombie.

Also, even if an LLM generates an original work, the weights it used may still be a derived work.