| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Polizeiposaune 534 days ago
	You wouldn't train a LLM on a corpus containing copyrighted works without ensuring you had the necessary rights to the works, would you?

5 comments

Workaccount2 534 days ago

LLMs are not massive archives of data. They are a tiny fraction of a fraction of a percent of the size of their training set.

And before you knee-jerk "it's a compression algo!", I invite you to archive all your data with an LLMs "compression algo".

link

timewizard 534 days ago

> LLMs are not massive archives of data.

Neither am I, yet, I am still capable of reproducing copyrighted works to a level that most would describe as illegal.

> And before you knee-jerk "it's a compression algo!"

It's literally a fundamental part of the technology so I can't see how you call it a "knee jerk." It's lossy compression, the same way a JPEG might be, and simply recompressing your picture to a lower resolution does not at all obviate your copyright.

> I invite you to archive all your data with an LLMs "compression algo".

As long as we agree it is _my data_ and not yours.

link

Isamu 534 days ago

> It's lossy compression, the same way a JPEG might be

Compression yes, but this is co-mingling as well. The entire corpus is compressed together, which identifies common patterns, and in the model they are essentially now overlapping.

The original document is represented statistically in the final model, but you’ve lost the ability to extract it closely. Instead you gain the ability to generate something statistically similar to a large number of original documents that are related or are structurally similar.

I’m just commenting, not disputing any argument about fair use.

link

BobbyTables2 534 days ago

Copying a single sentence verbatim from a 1000 page book is still plagiarism.

And is technically copyright infringement outside fair use exceptions.

link

concerndc1tizen 534 days ago

And similarly, translating those sentences into data points is still a derivative work, like transcribing music and then making a new recording is still derivative.

link

jpollock 534 days ago

derivative works still tend to be copyright violations.

link

concerndc1tizen 534 days ago

Yes, that's what I'm saying. An LLM washing machine doesn't get rid of the copyright.

link

int_19h 534 days ago

It doesn't matter. It's still a derived work.

link

baxtr 534 days ago

Well what isn’t in this world?

Would Einstein would have been possible without Newton?

link

int_19h 532 days ago

I'm fine with us ditching copyright altogether.

But as things are, the megacorps are training their LLMs on the commons while asserting "intellectual property" rights on the resulting weights. So, fuck them, and cheers to those who try to do something about this state of affairs.

link

thedailymail 534 days ago

Newton was public domain by Einstein's time.

link

jampekka 534 days ago

Indeed. Copyright was introduced in 1710, Principia was published in 1687.

link

yieldcrv 534 days ago

and even with our current copyright laws providing for long dated protection, it would have still been in public domain

link

tomjen3 534 days ago

You wouldn't read a book and teach others its lessons without a derived license, would you?

link

ben_w 534 days ago

When I was at school, we were sometimes all sat down in front of a TV to watch some movie on VHS tape (it was the 90s).

At the start of the tape, there was a copyright notice forbidding the VHS tape from being played at, amongst other places, schools.

link

dijksterhuis 534 days ago

as an example: saying “i really like james holden’s inheritors album for the rough and dissonant sounds” isn’t covered by copyright.

if i reproduced it verbatim using my mouth, or created a derived work which is noticeably similar to the original, that’s a different question though.

in your example, a derivative work example could be akin to only quoting from the book for the audience and modifying a word of each quote.

“derived” works are always a grey area, especially around generative machine learning right now.

link

yieldcrv 534 days ago

and therefore everyone has the necessary rights to read works, the necessary rights to critique of the works including for commercial purposes, and the necessary rights to derivative works including for commercial purposes

link

deadbabe 534 days ago

Fair use.

link

dijksterhuis 534 days ago

*only available in the USA, terms and conditions apply.

most other places use fair dealing which is more restrictive https://en.m.wikipedia.org/wiki/Fair_dealing

link

griomnib 534 days ago

Easy to claim, harder to justify once you start charging money for your subsequent creation.

Unless all LLM are a ruthless parody of human intelligence, which they may be, the legal issues will continue.

link

bayindirh 534 days ago

The moment you earn money from it, that's not fair use anymore. When I last checked, unlimited access to said models were not free, plus it's not "research" anymore.

- Addenda -

For the interested parties, the law states the following [0].

Notwithstanding the provisions of sections 17 U.S.C. § 106 and 17 U.S.C. § 106A, the fair use of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include:

    1. the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
    2. the nature of the copyrighted work;
    3. the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
    4. the effect of the use upon the potential market for or value of the copyrighted work.

The fact that a work is unpublished shall not itself bar a finding of fair use if such finding is made upon consideration of all the above factors

So, if you say that these factors can be flexed depending on the defendant, and can be just waved away to protect the wealthy, then it becomes something else, but given these factors, and how damaging this "fair use" is, I can certainly say that training AI models with copyrighted corpus is not fair use in any way.

Of course at the end of the day, IANAL & IANAJ. However, my moral compass directly bars use of copyrighted corpus in publicly accessible, for profit models which undermine many people of their livelihoods.

From my perspective, people can whitewash AI training as they see fit to sleep sound at night, but this doesn't change anything from my PoV.

[0]: https://en.wikipedia.org/wiki/Fair_use#U.S._fair_use_factors

link

FloorEgg 534 days ago

I really don't think it's that simple. I can read books and then earn money from applying what I learned in them. I can also study art and then make original art in the same or similar styles. If a person was doing this there would be no one claiming copyright infringement. The only difference is it's a machine doing it and not a person.

The nature of copyright and plagiarism boils down to paraphrasing, and so long as LLMs sufficiently paraphrase the content it's an open question whether it's copyright infringement and requires new law/precedent.

So the fact they are earning money is a red herring unless they are reproducing the exact same content without paraphrasing (with exception to commentary). E.g. they can quote part of a work while commenting on it.

Where they have gotten into trouble with e.g. NYT afaik is when the LLM reproduced a whole article word for word. I think they have all tried hard to prevent the LLM from ever doing that to avoid that legal risk.

link

bayindirh 534 days ago

> I can read books and then earn money from applying what I learned in them.

How many books can you read, understand and memorize in T time, and how many books an AI can ingest in the T time?

If we're down to paraphrasing, watch this video [1], and think again.

Many models, given that you ask the correct questions, reproduce their training set with great accuracy, and this is only prevented with monkey patching, IIUC.

So, it's still a big mess, even if we don't add copyrighted corpus to the mix. Oh, BTW, datasets like "The Stack" are not clean as they claim. I have seen at least two non-permissively licensed code repositories inside that dataset.

[1]: https://youtu.be/LrkAORPiaEA

link

FloorEgg 533 days ago

I agree it's a big mess, that was kind of my point.

I am curious about the video, but am not compelled to spend 24 min watching it when you haven't summarized its thesis for me. The title of the video makes it seem adjacent at best to the points I was making. (Some automated flagging system =/= actual law)

link

o11c 534 days ago

"Making money" does not immediately invalidate fair use, but it does wave a big red flag in the courts' faces.

link

throwaway2037 533 days ago

I would be more nuanced on this matter. As I understand, in the US, fair use allows media to write critiques of cultural artefacts (sorry, I cannot think of a better, broad term). For example, you can include small quotes from the film script when writing a critique of it without requiring permission from the owner of the copyright. And, until the World Wide Web arrived to the masses in the mid-1990s, most critiques were published by commercial media outlets, such as a daily newspaper. They were certainly published by commercial, for-profit entities. That said, I think the intent of the fair use is very important to the courts, much more than the entity that is doing the fair use (newspaper, blogger, etc.).

Another weird carve-out for copyright law in the US: parody. Honestly, I don't know if other jurisdictions allow parody in the same protected manner.

link

iggldiggl 531 days ago

> Another weird carve-out for copyright law in the US: parody. Honestly, I don't know if other jurisdictions allow parody in the same protected manner.

Germany: https://www.gesetze-im-internet.de/urhg/__51a.html (Though this explicit carve-out is a recent development, though generally speaking parodies were allowed even under the previous version of the law.)

link

throwaway2037 531 days ago

Your reference (link) is very impressive. Thank you to share. Honestly, I would struggle to provide the equivalent for US federal law (or court ruling). Are you a lawyer in DACH/Germany? How did you know to find this web page?

link

bayindirh 534 days ago

So you say that, every law is a suggestion depending who's being tried?

link

o11c 534 days ago

Er, what? I'm speaking directly from the law, 17 U.S.C. § 107. It's deliberately written in terms of "factors to consider", rather than absolutes.

> In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include:

> * the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;

> * the nature of the copyrighted work;

> * the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and

> * the effect of the use upon the potential market for or value of the copyrighted work.

link

xvector 534 days ago

You can absolutely monetize works altered under fair use.

link

bayindirh 534 days ago

Any examples sans current AI models? I have not seen any, or failed to find any, to precise.

link

xvector 534 days ago

Basically any YouTube video that shows another YouTube video, song, movie, etc. as part of something else (eg a voiceover.)

link

philwelch 534 days ago

You’re applying a double standard to LLM’s and human creators. Any human writer or artist or filmmaker or musician will be influenced by other people’s works, even while those works are still under copyright.

link

dijksterhuis 534 days ago

as a human being, and one that does music stuff, i don’t download terabytes of other peoples works from the internet directly into my brain. i don’t have verbatim reproductions of people’s work sitting around on a hard disk in my stomach/lungs/head/feet.

LLMs are not humans. They’re essentially a probabilistic compression algorithm (encode data into model weights/decode with prompt to retrieve data).

link

philwelch 534 days ago

Do you ever listen to music? Is your music ever influenced by the music that you listen to? How do you imagine that works, in an information-theoretical sense, that fundamentally differs from an LLM?

Depending on how much music you've listened to, you very well may have "downloaded terabytes" of it into your brain. Your argument is specious.

link

cmiles74 534 days ago

Information on how large language models are trained is not hard to come by, there are numerous articles that cover this material. Even a brief skimming of this material will make it clear that the training of large language models is materially different in almost every way from how human beings "learn" and build knowledge. There are still many open questions around the process of how humans collect, store, retrieve and synthesize information.

There is little mystery to how large language models function and it's clear that their output is parroting back portions of their training data, the quality of output degrades greatly when novel input is provided. Is your argument that people fundamentally function in the same way? That would be a bold and novel assertion!

link

philwelch 533 days ago

> There is little mystery to how large language models function and it's clear that their output is parroting back portions of their training data

If this were true, then you would be able to identify the specific work being "parroted" and you'd have a case for copyright infringement regardless of whether it was produced by an LLM at all. This isn't how LLMs work though. For instance, if an LLM's training data includes the complete works of a given author and then you prompt the LLM to write a story in the style of that author, it will actually write an original story instead of reproducing one of the stories in its training corpus. It won't be particularly good but it will be an original work.

It also isn't obvious whether or not, or to what degree, LLM training works differently from human learning. You yourself acknowledged that there are "many open questions" about how human learning works, so how can you be so confident that it's fundamentally different? It doesn't matter anyway because you can still apply the exact same standards to LLM output to judge whether it infringes copyright that you would to something that was produced by a human being.

link

dijksterhuis 534 days ago

i do listen to music.

i listen to it on apple music.

i pay money to apple for this.

some of that money that i pay to apple goes to the rights holders of that music for the copying and performance of their work through my speakers.

that’s a pretty big difference to how most LLMs are trained right there! i actually pay original creators some money.

i am a human being. you cannot reduce me down to some easy information theory.

an LLM is a tool. an algorithm. with the same random seed etc etc it will get the same results. it is not human.

you put me in the same room as yesterday i’ll behave completely differently.

i have listened to way more than terabytes of music in my life. doesn’t mean i have the ability to regurgitate any of it verbatim though. i’m crap at that stuff.

LLMs seem to be really good at it though.

link

cmiles74 534 days ago

I don't see how this is a double standard. Comparing a person interacting with their culture is not comparable in any way. IMHO, it's kind of a wacky argument to make.

link

crazygringo 534 days ago

Can you elaborate on how it's not comparable? It seems obvious to me that it is -- they both learn and then create -- so what's the difference?

If I can hire an employee who draws on knowledge they learned from copyrighted textbooks, why can't I hire an AI which draws on knowledge it learned from copyrighted textbooks? What makes that argument "wacky" in your eyes?

link

tekno45 534 days ago

you're asking why you have to treat people differently than you treat tools and machines.

link

crazygringo 534 days ago

Well obviously not in general. But when it comes to copyright law specifically, yes absolutely. That is the question I'm asking.

link

cmiles74 534 days ago

It has never been argued that copyright law should apply to information the people learn, whether that be from reading books or newspapers, watching television or appreciating art like paintings or photographs.

Unlike a person, an large language model is product built by a company and sold by a company. While I am not a lawyer, I believe much of the copyright arguments around LLM training revolve around the idea that copyrighted content should be licensed by the company training the LLM. In much the same way that people are not allowed to scrape the content of the New York Time website and then pass it off as their own content, so should OpenAI be barred from scraping the New York Times website to train ChatGPT and then sell the service without providing some dollars back to the New York Times.

link

salawat 534 days ago

You're not going to get an answer you find agreeable, because you're hoping for an answer that allows you to continue to treat the tool as chattel, without conferring to it the excess baggage of being an individuated entity/laborer.

You're either going to get: it's a technological, infinitely scalable process, and the training data should be considered what it is, which is intellectual property that should be being licensed before being used.

...or... It actually is the same as human learning, and it's time we started loading these things up with other baggage to be attached to persons if we're going to accept it's possible for a machine to learn like a human.

There isn't a reasonable middle ground due to the magnitude of social disruption a chattel quasi-human technological human replacement would cause.

link

_DeadFred_ 534 days ago

One is a collection of highly dithered data generated by machines paid for by a business in order to financially gain from the copyrighted works in order to replace any future need for copyrighted text books.

The other is a person learning from a copyrighted textbook in the legally protected manner, and whom and use the textbook was written for.

link

cmiles74 534 days ago

I don't think this question really makes any sense... In my opinion, it's kind of mish-mashing several things together.

"Can you elaborate on how it's not comparable?"

The process of individual people interacting with their culture is a vastly different process than that used to train large language models. In what ways to you think these processes have anything in common?

"It seems obvious to me that it is -- they both learn and then create -- so what's the difference?"

This doesn't seem obvious to me (obviously)! Maybe you can argue that an LLM "learns" during training, but that ceases once training is complete. For sure, there are work-arounds that meet certain goals (RAG, fine-tuning); maybe your already vague definition of "learning" could be stretched to include these? Still, comparing this to how people learn is pretty far-fetched. AFAICT, there's no literature supporting the view that there's any commonality here; if you have some I would be very interested to read it. :-)

Do they both create? I suspect not; an LLM is parroting back data from it's training set. We've seen many studies showing that tested LLMs perform poorly on novel problem sets. This article was posted just this week:

https://news.ycombinator.com/item?id=42565606

The court is still out on the copyright issue, for the perspective of US law we'll have to wait on this one. Still, it's clear that an LLM can't "create" in any meaningful way.

And so on and so forth. How is hiring an employee at all similar to subscribing to an OpenAI ChatGPT plan? Wacky indeed!

link

crazygringo 533 days ago

Obviously, on the inside, the process that a person goes through in learning and creating, and the process that a LLM goes through in learning and creating, is very different. Nobody will dispute that.

But if they're learning from the same kinds of materials, and producing the same kind of output, then obviously the comparison can be made. And your idea that LLM's don't create seems obviously false.

So I have to conclude the two seem comparable, and someone would have to show why different legal principles around copyright ought to apply, when it's a simple question of input/output. Why should it matter if it's a human or algorithm doing the processing, from a copyright perspective? Nothing "wacky" about the question at all.

link

groby_b 534 days ago

Unless you are making an argument for personhood, one is a machine, the other is a human. Different standards apply, end of discussion.

link

homarp 534 days ago

most probably your employee actually 'paid' for their textbook.

link

rob_c 534 days ago

That's a little simplistic. You're almost trying to say blank and white sands gray can't be compared which is a bit weird.

Strangely like the situation itself.

The question is just looked to how can we guarantee a model is influenced rather than memorising an input?

And then is a human who is influenced simply relying on a faulty or less than perfect memory?

link

_DeadFred_ 534 days ago

Human creators don't store that 'influence' in a digital machine accessible format generated directly from the copyrighted content though.

Although with the 'good new everyone, we built the torment nexus' trajectory of AI my guess is at this point AI companies would just incorporate actual human brains instead of digital storage if that was the requirement.

link

galangalalgol 534 days ago

Does that imply that if we invent brain upload technology, that my weights have every conflicting license and patent for everything I can quote or create? I don't like that precedent. I have complete rights over my noggin's contents. If I do quote a NYT article in it's entirely, that vould be infringement, but not copying my brain itself.

link

philwelch 534 days ago

Your argument boils down to “we don’t know how brains work”, and it is a non-sequitur. It isn’t a violation of copyright law to create original works under the creative influence of works still under copyright.

link