Hacker News new | ask | show | jobs
by theamk 866 days ago
TensorFlow is also perfectly legal to develop and distribute, and no one contests this.

People object to specific artifact, "model weights", which were produced using copyrighted works at the input, and can be used to reproduce those same copyrighted works back. In bittorrent analogy, people want to shut down specific pirate trackers and the pirate bay website.

2 comments

From the above EFF article:

> First, a derivative work still has to be “substantially similar” to the original in order to be infringing. If the original is transformed or abridged or adapted to such an extent that this is no longer true, then it’s not a derivative work. A 10-line summary of a 15,000-line epic isn’t a derivative work, and neither are most summaries of books that people make in order to describe those copyrighted works to others.

The statistics generated about the works entered as input, do not resemble the original works. Nor can those statistics on their own reproduce the original work. At most they are brief mathematical summaries of the work. And it's only after combining those stats with the stats of billions of other works (which is its own creative process to determine the best statistical methodologies to achieve that combination) that anything intelligble can be produced in the output stage.

I think the case for Stable Diffusion in general is not too bad, however EFF tempers their optimism when it comes to cases where the model may actually memorize the inputs:

> To sum up: a diffusion model can, in rare circumstances, generate images that resemble elements of the training data. De-duplication can substantially reduce the risk of this occurring. But the strongest copyright suit against a diffusion-based AI art generator would likely be one brought by the holder of the copyright in an image that subsequently was actually reproduced this way.

EFF's position seems to be (to which I personally agree, FWIW) that Stable Diffusion almost certainly does not run afoul of at least the vast majority of copyright holders of data it was trained from.

> The statistics generated about the works entered as input, do not resemble the original works. Nor can those statistics on their own reproduce the original work. At most they are brief mathematical summaries of the work.

Of course, this needs a lot of qualification. Compression and intelligence are generally considered to be related, and indeed, compression also works on statistical analysis (like entropy coding a la Huffman, or frequency analysis via Fourier transforms). Granted, compression algorithms are designed to reproduce their input verbatim--it's the entire point. But I think ML weights may exist somewhere "in the middle" so to speak; depending on the model architecture and how it's trained, it may be more or less literally like compression. Vastly overfit models are very much like compression, whereas large generalized models like Stable Diffusion are pretty far away and yes, generally can't reproduce inputs verbatim. (However: I suspect many LoRA models are quite overfit and may not be in the same boat.)

However, that's just for image generation. I feel like LLMs and text generation are an entirely different ballgame, and given that we can't actually inspect the model weights in the case of GPT4, the best we can really do to surmise what's going on is to see how badly it seems to overfit its training data.

I am unconvinced that this matter is settled as a whole, although I do think the EFF article presents a good overview of the case regarding Stable Diffusion and it does coincide pretty closely with what I actually believe. But this article is about large language models, which may legitimately be a completely different ball game.

One thing that I think people forget about is that the prompt used when "reproduc[ing] those same copyrighted works" is also a part of why it spits out similar things. It's not just the model doing it. A traditional artist can be prompted to recreate a copyrighted work in much the same way with the right prompts.
I don't think most people are misinterpreting things. The truth is that models which are not terribly overfit literally don't output verbatim inputs often, in fact, for Stable Diffusion it's apparently nearly infinitesimally small odds, and this is good because that implies that the weights are in fact, not literally encoding some crazy kind of compressed copies of the images in question.

On the other hand, if you prompt a code generating model with some comment and a function declaration that it knows exists and it spits out 100+ lines of nearly verbatim code, that's a completely different story entirely. If I prompt a human with that sort of thing, they will almost certainly write different code even if they've seen the original source code in question. This is in part because the way humans write code is different from the way LLMs write code; humans tend to iterate somewhat non-linearly, and I think if you ask the same person to write the same thing on different days, they would probably come up with different results. It would be quite rare for a human to just see a familiar segment of code and then begin dumping near-verbatim copies of existing codebases.

AI models that readily and easily bias themselves toward outputting their inputs do exist. It is not clear how many models actually do this, but this is definitely a huge part of the concern when people talk about copyright and model weights.

It's a bit clouded by people who are just generally hoping that today's AI model weights are illegal for social reasons, but that's not the position I am trying to present. (I'm not really sure what we should do regarding societal impact.)