Hacker News new | ask | show | jobs
by yokem55 865 days ago
> I think that from a legal standpoint, the actual technical means by which something was accomplished doesn't matter if the process as a whole is effectively copyright infringement.

Which is why when the user of the model prompts for something infringing, and is successful at getting close to verbatim output (because the prompt was too constraining, becuase the work is overrepresented in the training) it is that particular output that is infringing. And maybe that means that services operating that prompt/response software are guilty of contributory infringment if they can't adequetly prevent that kind of output.

But that doesn not mean that training the model was infringing. Nor does that mean distribution of the model is infringing. And if a user of the prompt/response software never prompts for anything infringing, and the software never spontaneously recreates anything infringing, there's no infringment happening.

There are lots of technologies out there that are highly capable of enabling infringment at a massive scale. And where the vast majority of their actual usage is absolutely infringing. But we don't completely shut down those technologies that on their own - are not infringing. Bittorrent clients are pefectly legal to develop. And distribute. And people use those clients to commit infringment at large scale. But they are still pefectly legal to write and distrubute.

1 comments

TensorFlow is also perfectly legal to develop and distribute, and no one contests this.

People object to specific artifact, "model weights", which were produced using copyrighted works at the input, and can be used to reproduce those same copyrighted works back. In bittorrent analogy, people want to shut down specific pirate trackers and the pirate bay website.

From the above EFF article:

> First, a derivative work still has to be “substantially similar” to the original in order to be infringing. If the original is transformed or abridged or adapted to such an extent that this is no longer true, then it’s not a derivative work. A 10-line summary of a 15,000-line epic isn’t a derivative work, and neither are most summaries of books that people make in order to describe those copyrighted works to others.

The statistics generated about the works entered as input, do not resemble the original works. Nor can those statistics on their own reproduce the original work. At most they are brief mathematical summaries of the work. And it's only after combining those stats with the stats of billions of other works (which is its own creative process to determine the best statistical methodologies to achieve that combination) that anything intelligble can be produced in the output stage.

I think the case for Stable Diffusion in general is not too bad, however EFF tempers their optimism when it comes to cases where the model may actually memorize the inputs:

> To sum up: a diffusion model can, in rare circumstances, generate images that resemble elements of the training data. De-duplication can substantially reduce the risk of this occurring. But the strongest copyright suit against a diffusion-based AI art generator would likely be one brought by the holder of the copyright in an image that subsequently was actually reproduced this way.

EFF's position seems to be (to which I personally agree, FWIW) that Stable Diffusion almost certainly does not run afoul of at least the vast majority of copyright holders of data it was trained from.

> The statistics generated about the works entered as input, do not resemble the original works. Nor can those statistics on their own reproduce the original work. At most they are brief mathematical summaries of the work.

Of course, this needs a lot of qualification. Compression and intelligence are generally considered to be related, and indeed, compression also works on statistical analysis (like entropy coding a la Huffman, or frequency analysis via Fourier transforms). Granted, compression algorithms are designed to reproduce their input verbatim--it's the entire point. But I think ML weights may exist somewhere "in the middle" so to speak; depending on the model architecture and how it's trained, it may be more or less literally like compression. Vastly overfit models are very much like compression, whereas large generalized models like Stable Diffusion are pretty far away and yes, generally can't reproduce inputs verbatim. (However: I suspect many LoRA models are quite overfit and may not be in the same boat.)

However, that's just for image generation. I feel like LLMs and text generation are an entirely different ballgame, and given that we can't actually inspect the model weights in the case of GPT4, the best we can really do to surmise what's going on is to see how badly it seems to overfit its training data.

I am unconvinced that this matter is settled as a whole, although I do think the EFF article presents a good overview of the case regarding Stable Diffusion and it does coincide pretty closely with what I actually believe. But this article is about large language models, which may legitimately be a completely different ball game.

One thing that I think people forget about is that the prompt used when "reproduc[ing] those same copyrighted works" is also a part of why it spits out similar things. It's not just the model doing it. A traditional artist can be prompted to recreate a copyrighted work in much the same way with the right prompts.
I don't think most people are misinterpreting things. The truth is that models which are not terribly overfit literally don't output verbatim inputs often, in fact, for Stable Diffusion it's apparently nearly infinitesimally small odds, and this is good because that implies that the weights are in fact, not literally encoding some crazy kind of compressed copies of the images in question.

On the other hand, if you prompt a code generating model with some comment and a function declaration that it knows exists and it spits out 100+ lines of nearly verbatim code, that's a completely different story entirely. If I prompt a human with that sort of thing, they will almost certainly write different code even if they've seen the original source code in question. This is in part because the way humans write code is different from the way LLMs write code; humans tend to iterate somewhat non-linearly, and I think if you ask the same person to write the same thing on different days, they would probably come up with different results. It would be quite rare for a human to just see a familiar segment of code and then begin dumping near-verbatim copies of existing codebases.

AI models that readily and easily bias themselves toward outputting their inputs do exist. It is not clear how many models actually do this, but this is definitely a huge part of the concern when people talk about copyright and model weights.

It's a bit clouded by people who are just generally hoping that today's AI model weights are illegal for social reasons, but that's not the position I am trying to present. (I'm not really sure what we should do regarding societal impact.)