You’re applying a double standard to LLM’s and human creators. Any human writer or artist or filmmaker or musician will be influenced by other people’s works, even while those works are still under copyright.
as a human being, and one that does music stuff, i don’t download terabytes of other peoples works from the internet directly into my brain. i don’t have verbatim reproductions of people’s work sitting around on a hard disk in my stomach/lungs/head/feet.
LLMs are not humans. They’re essentially a probabilistic compression algorithm (encode data into model weights/decode with prompt to retrieve data).
Do you ever listen to music? Is your music ever influenced by the music that you listen to? How do you imagine that works, in an information-theoretical sense, that fundamentally differs from an LLM?
Depending on how much music you've listened to, you very well may have "downloaded terabytes" of it into your brain. Your argument is specious.
Information on how large language models are trained is not hard to come by, there are numerous articles that cover this material. Even a brief skimming of this material will make it clear that the training of large language models is materially different in almost every way from how human beings "learn" and build knowledge. There are still many open questions around the process of how humans collect, store, retrieve and synthesize information.
There is little mystery to how large language models function and it's clear that their output is parroting back portions of their training data, the quality of output degrades greatly when novel input is provided. Is your argument that people fundamentally function in the same way? That would be a bold and novel assertion!
> There is little mystery to how large language models function and it's clear that their output is parroting back portions of their training data
If this were true, then you would be able to identify the specific work being "parroted" and you'd have a case for copyright infringement regardless of whether it was produced by an LLM at all. This isn't how LLMs work though. For instance, if an LLM's training data includes the complete works of a given author and then you prompt the LLM to write a story in the style of that author, it will actually write an original story instead of reproducing one of the stories in its training corpus. It won't be particularly good but it will be an original work.
It also isn't obvious whether or not, or to what degree, LLM training works differently from human learning. You yourself acknowledged that there are "many open questions" about how human learning works, so how can you be so confident that it's fundamentally different? It doesn't matter anyway because you can still apply the exact same standards to LLM output to judge whether it infringes copyright that you would to something that was produced by a human being.
some of that money that i pay to apple goes to the rights holders of that music for the copying and performance of their work through my speakers.
that’s a pretty big difference to how most LLMs are trained right there! i actually pay original creators some money.
-
i am a human being. you cannot reduce me down to some easy information theory.
an LLM is a tool. an algorithm. with the same random seed etc etc it will get the same results. it is not human.
you put me in the same room as yesterday i’ll behave completely differently.
-
i have listened to way more than terabytes of music in my life. doesn’t mean i have the ability to regurgitate any of it verbatim though. i’m crap at that stuff.
I don't see how this is a double standard. Comparing a person interacting with their culture is not comparable in any way. IMHO, it's kind of a wacky argument to make.
Can you elaborate on how it's not comparable? It seems obvious to me that it is -- they both learn and then create -- so what's the difference?
If I can hire an employee who draws on knowledge they learned from copyrighted textbooks, why can't I hire an AI which draws on knowledge it learned from copyrighted textbooks? What makes that argument "wacky" in your eyes?
It has never been argued that copyright law should apply to information the people learn, whether that be from reading books or newspapers, watching television or appreciating art like paintings or photographs.
Unlike a person, an large language model is product built by a company and sold by a company. While I am not a lawyer, I believe much of the copyright arguments around LLM training revolve around the idea that copyrighted content should be licensed by the company training the LLM. In much the same way that people are not allowed to scrape the content of the New York Time website and then pass it off as their own content, so should OpenAI be barred from scraping the New York Times website to train ChatGPT and then sell the service without providing some dollars back to the New York Times.
You're not going to get an answer you find agreeable, because you're hoping for an answer that allows you to continue to treat the tool as chattel, without conferring to it the excess baggage of being an individuated entity/laborer.
You're either going to get: it's a technological, infinitely scalable process, and the training data should be considered what it is, which is intellectual property that should be being licensed before being used.
...or... It actually is the same as human learning, and it's time we started loading these things up with other baggage to be attached to persons if we're going to accept it's possible for a machine to learn like a human.
There isn't a reasonable middle ground due to the magnitude of social disruption a chattel quasi-human technological human replacement would cause.
No, you’re missing the point of copyright. The point of copyright is to protect an exclusive right to copy, not the right to produce original works influenced by previous works. If an LLM produces original works that are influenced by the training data, that is not a violation of copyright. If it reproduces the training data verbatim, it is.
One is a collection of highly dithered data generated by machines paid for by a business in order to financially gain from the copyrighted works in order to replace any future need for copyrighted text books.
The other is a person learning from a copyrighted textbook in the legally protected manner, and whom and use the textbook was written for.
I don't think this question really makes any sense... In my opinion, it's kind of mish-mashing several things together.
"Can you elaborate on how it's not comparable?"
The process of individual people interacting with their culture is a vastly different process than that used to train large language models. In what ways to you think these processes have anything in common?
"It seems obvious to me that it is -- they both learn and then create -- so what's the difference?"
This doesn't seem obvious to me (obviously)! Maybe you can argue that an LLM "learns" during training, but that ceases once training is complete. For sure, there are work-arounds that meet certain goals (RAG, fine-tuning); maybe your already vague definition of "learning" could be stretched to include these? Still, comparing this to how people learn is pretty far-fetched. AFAICT, there's no literature supporting the view that there's any commonality here; if you have some I would be very interested to read it. :-)
Do they both create? I suspect not; an LLM is parroting back data from it's training set. We've seen many studies showing that tested LLMs perform poorly on novel problem sets. This article was posted just this week:
The court is still out on the copyright issue, for the perspective of US law we'll have to wait on this one. Still, it's clear that an LLM can't "create" in any meaningful way.
And so on and so forth. How is hiring an employee at all similar to subscribing to an OpenAI ChatGPT plan? Wacky indeed!
Obviously, on the inside, the process that a person goes through in learning and creating, and the process that a LLM goes through in learning and creating, is very different. Nobody will dispute that.
But if they're learning from the same kinds of materials, and producing the same kind of output, then obviously the comparison can be made. And your idea that LLM's don't create seems obviously false.
So I have to conclude the two seem comparable, and someone would have to show why different legal principles around copyright ought to apply, when it's a simple question of input/output. Why should it matter if it's a human or algorithm doing the processing, from a copyright perspective? Nothing "wacky" about the question at all.
Human creators don't store that 'influence' in a digital machine accessible format generated directly from the copyrighted content though.
Although with the 'good new everyone, we built the torment nexus' trajectory of AI my guess is at this point AI companies would just incorporate actual human brains instead of digital storage if that was the requirement.
Does that imply that if we invent brain upload technology, that my weights have every conflicting license and patent for everything I can quote or create? I don't like that precedent. I have complete rights over my noggin's contents. If I do quote a NYT article in it's entirely, that vould be infringement, but not copying my brain itself.
Your argument boils down to “we don’t know how brains work”, and it is a non-sequitur. It isn’t a violation of copyright law to create original works under the creative influence of works still under copyright.
LLMs are not humans. They’re essentially a probabilistic compression algorithm (encode data into model weights/decode with prompt to retrieve data).