Thought question, not entirely related but if you want to go that that route it actually is.
If I generate some media in say Photoshop. I then send you a JPEG representation of said media. You then distribute a PNG copy of the image without license. Have you violated copyright law?
At what point is there enough parameters to an LLM that it is effectively just a compressed version.
How about deduplicated storage? Is an image stored on that and the reproduced using an index of some sort to distribute a violation.
If I put data into a thing,lets call it training and lets call the thing a model, and then I request the data out of it and get what is perceived as an exact replication of said thing did I create a copy.
Does it matter if we call the thing a hard drive instead?
The human brain has 100 billion neurons, is it just effectively compressing everything that it's ever seen? I don't actually know but my feeling is no.
If I ask you to draw Mickey Mouse, you can probably produce a very good representation of him. If I asked you to write the script of The Matrix, assuming you've seen it, I suspect you'd get all the plot points down and major quotes even if it has been years since you've seen it. Are you creating a copy? Absolutely! Don't distribute either of those things without a license. But does the fact that you are capable of making a copy of a thing when asked mean that you've violated copyright way back when you watched the Matrix? Is there a copy of the Matrix or Mickey Mouse in your brain?
I will take the strong position that neither our brains nor LLMs contain copies of data in the way that is a violation of copyright. But both are equally capable of generating copyright violating materials.
However, one can memorize something like a book, particularly if one uses known "memory palace" techniques. Some individuals are particularly good at this.
Of course it matters. If I put the right data into a paintbrush and canvas I'll reproduce copyrighted works too. Nobody is confusing the model for a Picasso anymore than they confuse a paintbrush for a painting. The law may rule differently for one reason or another, but these are obviously different categories of things.
The reason a jpeg is copyright infringement has more to do with its express purpose being to allow the user to view that copyrighted work. If it were bundled in a program that just allowed you to view the color histogram of famous works (and the author had the right to view those photos and didn't think it important to save bandwidth by precomputing those histograms) it likely wouldn't be infringement. If it were found out that people were downloading that program just to rip the bundled images out then the author might get in hot water anyway. Your distributed file example is similarly probably infringing.
The model has other capabilities, and I think it would be hard to argue that its purpose is copyright infringement (which is separate from what you seem to be doing, which is arguing that the model itself is infringement -- both a little easier and harder to argue because it pushes more on philosophical distinctions than statements of fact about how people are using a thing).
Separately, there are new classes of concerns these models introduce. We don't have to abuse copyright law to take the time to consider those effects. E.g., should voice cloning be allowed and to what degree? It's already illegal in a lot of contexts (fraud, ...), but we don't currently have many rights when it comes to our innate physical characteristics. To the extent those rights exist, you often have to waive them for basic services (e.g., a nontrivial fraction of leases and jobs stipulate that you give a permanent, <much other legal jargon>, license for them to use your image for nearly any purpose, including falsely characterizing your approval of the property in advertisements and marketing materials -- unless covered under libel/slander and a couple other carveouts they're probably not punishable). Can studios just refuse to hire voice actors for more than one session? Is that good for society? Can I clone passers-by on the street to play in my commercial? These are new enough capabilities (at least at their current scale) that they're not very well legislated, and I wouldn't be surprised if we saw an expansion of something like "moral rights" to cover them.
> I then send you a JPEG representation of said media. You then distribute a PNG copy of the image without license.
The purpose and function of an image format is to represent a single piece. The representation can vary in accuracy and can be changed to another representation (JPEG to PNG, or one JPEG implementation to another JPEG implementation), but the underlying piece is supposed to be the same in intent and a majority of the time is the same in practice. Open the JPEG image using any program that implements the JPEG specification and you will get the same image as with a different such program. The same would apply to the use of encryption on the image, but not to a cryptographic hash (designed to be one-way). Decrypt the encrypted image and you'll get something that's the same in intent and in technicality as the original image. If you don't use a tool which has the purpose of decrypting things then you almost certainly won't get the same image back. The same only partially but at least partially applies to an AI: if you don't try to use the AI to reproduce an existing work then you still might get a reproduction of an existing work; the probability of such a result varies greatly depending on the prompt.
The purpose of an LLM isn't to reproduce one or more works - or rather, sections of expression - in the dataset. The purpose of an LLM is to produce speech similar to a human's response. The purpose of an image generator model is to produce images that have the characteristics specified in the prompt. In order to produce a copy of something in the training set, the prompt usually needs to reference a specific work, a related person (e.g. an author), a related work, or an attribute that is strongly associated with a particular work/author. Regarding the latter, there was a Hacker News post (that I can't find because I forgot the post title) from a month or two back about an AI image generator that produced images of the robot C-3PO from Star Wars even though the prompt was about "space" and "robot" with no reference to Star Wars. My interpretation is that the AI model had a strong association between space robot and Star Wars because C-3PO is (I speculate) one of the most common space-related robots that people talk about online. Or perhaps, the Star Wars works in the training set made up a majority of the works associated with both "space" and "robot". But I digress.
The likelihood that an AI produces a copy of existing expression depends on the prompt. A user who encounters such a case can avoid liability by not using and not sharing the output, and otherwise the output might not substitute for the original expression for the user's purposes. So in most cases I think liability for the infringinging outputs of an AI model should fall solely on the prompter. The liability that should fall on the developer of the AI model doesn't have to be binary. There can be heavier penalties on the developer for an AI that is more likely to reproduce C-3PO when given a vague prompt such as "space robot", lesser penalties on the developer for a model that only produces C-3PO when the prompt is at least as specific as "space war robot", and even lesser or no penalties for a model that only produces C-3PO from a prompt as specific as "golden space robot". The threshold for vague prompt would vary; for a prompt such as "painting of melting clock" I would excuse a partial reproduction of Salvador DalĂ's The Persistence of Memory.