Hacker News new | ask | show | jobs
by yokem55 864 days ago
The (in my view) problem with the author's argument is that the first step he claims is happening, is not. Publicly available content gets read, as is the point of publicly publishing it. Then the user uses a computer program to make some statistics about the bit of content. Those bits of statistics about that specific work, on their own, cannot reproduce or recrate the specific work. Then those statistics are put into a database and combined with the stats about billions of other works. Then another program is written to query the database to make probabilistic guesses responding to the prompts from a user. It's this last stage could potentially recreate a work in an infringing manner. But everthing that led up to that point (creating the model) is simply not something that current law considers to be infringing of copyright in any meaningful way. It doesn't even require a "fair use" assessment, because, creating statistics about a work, that cannot on their own reproduce the work, does not create a copy, nor does it make a public performance of the work.

Is this all terribly unfair to the people that published their work assuming this couldn't happen? Yes. But the response needs to be "lets come up with and pass better law" and not "lets twist and contort the current law to be something it's not."

3 comments

Forgive me for this one, but it comes from genuine curiosity and not snark. You are making assertions about how copyright law works, but you don't qualify it with either IANAL or any legal credentials. So I must ask: do you have a basis for these claims?

I love participating in armchair analysis of the law, since in software we pretty much have no choice but to do so anyway, but my understanding has always been that we still don't actually have strong case-law for machine learning and AI. It does seem like the existing cases regarding weights and ML training have leaned strongly towards the weights in general not being considered a derivative work, but I have doubts that the law would see this as black and white; for example, even if the general consensus is that ML training to produce weights, in and of itself, does not create a derivative work, if you are able to show that a given set of weights is able to verbatim reproduce inputs (as a result of overfitting or memorization), I have my suspicions that it would not be shrugged off so easily. In true "color of my bits" fashion, I think that from a legal standpoint, the actual technical means by which something was accomplished doesn't matter if the process as a whole is effectively copyright infringement.

There do seem to be some ongoing cases regarding this such as Getty Images v. Stability AI and it will be interesting to see their result.

The EFF's post about it at each of the three steps of obtaining, training, and generating an output image: https://www.eff.org/deeplinks/2023/04/how-we-think-about-cop...

It's an interesting read and makes a good case for why none of the steps are directly copyright infringement, even if you can prompt the output to be (and in that case the person doing the prompting should be the one at fault, same as someone drawing something infringing directly).

> I think that from a legal standpoint, the actual technical means by which something was accomplished doesn't matter if the process as a whole is effectively copyright infringement.

Which is why when the user of the model prompts for something infringing, and is successful at getting close to verbatim output (because the prompt was too constraining, becuase the work is overrepresented in the training) it is that particular output that is infringing. And maybe that means that services operating that prompt/response software are guilty of contributory infringment if they can't adequetly prevent that kind of output.

But that doesn not mean that training the model was infringing. Nor does that mean distribution of the model is infringing. And if a user of the prompt/response software never prompts for anything infringing, and the software never spontaneously recreates anything infringing, there's no infringment happening.

There are lots of technologies out there that are highly capable of enabling infringment at a massive scale. And where the vast majority of their actual usage is absolutely infringing. But we don't completely shut down those technologies that on their own - are not infringing. Bittorrent clients are pefectly legal to develop. And distribute. And people use those clients to commit infringment at large scale. But they are still pefectly legal to write and distrubute.

TensorFlow is also perfectly legal to develop and distribute, and no one contests this.

People object to specific artifact, "model weights", which were produced using copyrighted works at the input, and can be used to reproduce those same copyrighted works back. In bittorrent analogy, people want to shut down specific pirate trackers and the pirate bay website.

From the above EFF article:

> First, a derivative work still has to be “substantially similar” to the original in order to be infringing. If the original is transformed or abridged or adapted to such an extent that this is no longer true, then it’s not a derivative work. A 10-line summary of a 15,000-line epic isn’t a derivative work, and neither are most summaries of books that people make in order to describe those copyrighted works to others.

The statistics generated about the works entered as input, do not resemble the original works. Nor can those statistics on their own reproduce the original work. At most they are brief mathematical summaries of the work. And it's only after combining those stats with the stats of billions of other works (which is its own creative process to determine the best statistical methodologies to achieve that combination) that anything intelligble can be produced in the output stage.

I think the case for Stable Diffusion in general is not too bad, however EFF tempers their optimism when it comes to cases where the model may actually memorize the inputs:

> To sum up: a diffusion model can, in rare circumstances, generate images that resemble elements of the training data. De-duplication can substantially reduce the risk of this occurring. But the strongest copyright suit against a diffusion-based AI art generator would likely be one brought by the holder of the copyright in an image that subsequently was actually reproduced this way.

EFF's position seems to be (to which I personally agree, FWIW) that Stable Diffusion almost certainly does not run afoul of at least the vast majority of copyright holders of data it was trained from.

> The statistics generated about the works entered as input, do not resemble the original works. Nor can those statistics on their own reproduce the original work. At most they are brief mathematical summaries of the work.

Of course, this needs a lot of qualification. Compression and intelligence are generally considered to be related, and indeed, compression also works on statistical analysis (like entropy coding a la Huffman, or frequency analysis via Fourier transforms). Granted, compression algorithms are designed to reproduce their input verbatim--it's the entire point. But I think ML weights may exist somewhere "in the middle" so to speak; depending on the model architecture and how it's trained, it may be more or less literally like compression. Vastly overfit models are very much like compression, whereas large generalized models like Stable Diffusion are pretty far away and yes, generally can't reproduce inputs verbatim. (However: I suspect many LoRA models are quite overfit and may not be in the same boat.)

However, that's just for image generation. I feel like LLMs and text generation are an entirely different ballgame, and given that we can't actually inspect the model weights in the case of GPT4, the best we can really do to surmise what's going on is to see how badly it seems to overfit its training data.

I am unconvinced that this matter is settled as a whole, although I do think the EFF article presents a good overview of the case regarding Stable Diffusion and it does coincide pretty closely with what I actually believe. But this article is about large language models, which may legitimately be a completely different ball game.

One thing that I think people forget about is that the prompt used when "reproduc[ing] those same copyrighted works" is also a part of why it spits out similar things. It's not just the model doing it. A traditional artist can be prompted to recreate a copyrighted work in much the same way with the right prompts.
I don't think most people are misinterpreting things. The truth is that models which are not terribly overfit literally don't output verbatim inputs often, in fact, for Stable Diffusion it's apparently nearly infinitesimally small odds, and this is good because that implies that the weights are in fact, not literally encoding some crazy kind of compressed copies of the images in question.

On the other hand, if you prompt a code generating model with some comment and a function declaration that it knows exists and it spits out 100+ lines of nearly verbatim code, that's a completely different story entirely. If I prompt a human with that sort of thing, they will almost certainly write different code even if they've seen the original source code in question. This is in part because the way humans write code is different from the way LLMs write code; humans tend to iterate somewhat non-linearly, and I think if you ask the same person to write the same thing on different days, they would probably come up with different results. It would be quite rare for a human to just see a familiar segment of code and then begin dumping near-verbatim copies of existing codebases.

AI models that readily and easily bias themselves toward outputting their inputs do exist. It is not clear how many models actually do this, but this is definitely a huge part of the concern when people talk about copyright and model weights.

It's a bit clouded by people who are just generally hoping that today's AI model weights are illegal for social reasons, but that's not the position I am trying to present. (I'm not really sure what we should do regarding societal impact.)

There is a nice essay from 2004 that answers that question, "What Color Are Your Bits" (https://ansuz.sooke.bc.ca/entry/23, discussion https://news.ycombinator.com/item?id=24917679)

It talks about copyright infringement in music, but it applies just as well to AI training, just substitute "scrambled file" with "model weights":

> The scrambled file still has the copyright Colour because it came from the copyrighted input file. It doesn't matter that it looks like, or maybe even is bit-for-bit identical with, some other file that you could get from a random number generator. It happens that you didn't get it from a random number generator. You got it from copyrighted material; it is copyrighted. The randomly-generated file, even if bit-for-bit identical, would have a different Colour. The Colour inherits through all scrambling and descrambling operations and you're distributing a copyrighted work,

Why should a crime depend on how the tool used for the crime was created. Like 2 guys write 2 scripts that output a copyrighted poem, then the cool one guy, call him Sam can go free because he used algorithm A but second guy goes to jail because he used algorithm B where the crime is exactly identical.