Hacker News new | ask | show | jobs
More writers sue OpenAI for copyright infringement over AI training (reuters.com)
54 points by kurhan 1010 days ago
9 comments

Random thought: my blog is licensed under a Creative Commons license [1] that allows you to use and transform my content as long as you give attribution and distribute your contributions under the same terms.

I found the OpenAI bot scraping my blog recently. Assuming they used that data, when will they attribute me?

[1] https://creativecommons.org/licenses/by-sa/4.0/

These AI companies not complying with the licenses on code has meant since Microsoft released their code generator I haven't contributed a single line of open source software nor released any of my projects that way. I removed a bunch a while ago and I will likely remove all of them when I get around to it. I have been fixing bugs and releasing open source projects for decades and I just stopped the moment they did that. Open source is dead to me if the licenses can't be enforced.
Your license doesn't override copyright law.

Given that Google successfully used a fair use defense in Authors Guild, Inc. v. Google, Inc., I think it's likely OpenAI and the others will also win in court.

I do think it's possible for specific uses of the output of LLMs to be copyright infringement. That's why it's interesting to see Microsoft to indemnify customers of their commercial products in the event a case is brought against the customer. This is smart on Microsoft's part; the risk probably isn't very high and by making it a non-issue for their customers, many more will feel comfortable using their LLM-based features and services.

Well it all comes down to whether training an LLM is fair use or not. I think it is likely that courts rule it is transformative enough that training is allowed regardless of what terms you have for the use of content.
Interesting question, continuing on this, since they probably used GPL-3 code with the Affero clause, do they have to open source GPT? (The Affero clause is I believe the more directly applicable license thingy, though CC by-sa should also work.)

https://www.gnu.org/licenses/agpl-3.0.en.html

I think all the code license question does not matter much, because the code is data input, not a part of their actual program

Like githubs servers host AGPL code as data, without having to be open-source

The perceived problem there, is if their model generates an exact copy of some AGPL code, and you use it in your project unknowingly, and then you get can sued

How recent? Because ChatGPT is always on the same mantra, of its training being from back September 2021 with no updates...Even for ChatGPT-4
Where's your attributions for all the words written in your comment ;P you remixed the words and grammar patterns from other people's creative common's licenses of other people writings!

Note: i'm declaring my comment license as https://creativecommons.org/licenses/by-sa/4.0/

So if you remix or transform my comment by responding it, please attribute to me your response.

Humans are not LLMs trained and operated by a company for profit. Your argument is that LLMs hold all the same basic rights as humans but they hold (and should hold) exactly none.
That might be the implied argument, but the explicit argument appears to be the licensing a grouping of words as a work and then declaring any use of any of those words or letters in any order to be a transformation of that work without any other context or evidence of transformation is silly. We could rephrase the OP's post as:

>I found [logs of users from Paramount's writers offices reading] my blog recently. Assuming they used that data, when will they attribute me?

To see that the idea on the face is silly. OP has no evidence that any of their work was used at all, or even that what was used could even be covered under the license in the first place.

which works specially?

profit vs non profit also makes a difference

I believe that OpenAI is not required to attribute you if the output was produced by an OpenAI-operated AI model because the AI is not constrained by the Berne Convention treaty regime in the same way that people are.

I believe that this fact is and will be exploited to strip copyright and effectively transfer ownership using cleanroom/firewall techniques.

training will be ruled fair use which doesn't require any license, while there is no lawsuit on the output
> Assuming they used that data...

That's the key part. You haven't yet proved they have actually used your content for anything (other than, potentially, read the license to decide if they should include or discard from their training set).

But in practice we'll never know for sure if they are respecting the terms of licenses until 1) this is tested in court, or 2) there's some internal leak that points into either direction.

I expect that OpenAI would concede that they used the data in any court case immediately to get that issue off the table, I really don't think they have a strong interest in foot-dragging on this stuff, right?

I would think OpenAI wants the thornier legal issues actually settled so that the whole ecosystem can grow within those terms & they can lobby for the legal changes they need/want?

The alternative would be discovery on that issue, which they may want to avoid.
> wants the thornier legal issues actually settled

.. wants the thornier issues to be debated and re-tried ad infinitum, as long as they generate cash flow and build their moat(s).. more likely

https://the-decoder.com/openai-apparently-going-all-in-on-ch...

This behaviour seems more consistent with wanting is sorted out than stalling for time.

a motion to dismiss is "going all in" ?
All these lawsuits will die. Why?

Because people train on corpuses of data all the time, without a license or any attribution.

Every piece of text a writer reads is training that writer. Every image an artist sees helps to train that artist. Every sound a musician hears is training that musician.

That doesn't mean they can't exclude their works from training via a license going foreward. But that becomes an enforcement problem.

IIRC courts have already ruled AI-generated works cannot have copyright. So there is already a legal distinction between a human and a model creating works.

I also doubt "humans are just a larger Markov chain than the LLM and they're allowed to" will hold up in court.

I don’t see what eligibility to have works protected has to do with legality of learning.

I really hope “copyright can be used to prohibit reading and learning” does not hold up in court.

Copyright is, and should be, a protection from unauthorized reproduction. Extending it to protect the abstract ideas would be a disaster. And extending it to control stylistic learning would be even worse.

You are not understanding and making a lot of assumptions isn't a substitute for that.
Hard to argue with that level of reasoning and sourcing.
It's a good thing humans and computers are 2 wholly separate categories of things that have 0 things related to them other than computers being anthropomorphized by AI sycophants!
People can do a lot of things that we don't legally allow machines or automation to do.
True of some things, but not of Fair Use. Automatically generating thumbnails is generally Fair Use, for example.
Like what? The only things I can think of relate to quality/safety (eg. drivers or lawyers).
Participate politically, seek employment, and of course the examples you gave.

When pubic safety and goodwill comes in to focus, that's where the role of automation is scrutinized and minimized more heavily. Copyright itself is an invention and area of balancing individual rights and greater public good.

Machines are not human and they are not sentiment and sapient at a level where we can view them differently. Perhaps they will change one day, but as it is today these systems are not entitled to do the same things humans get to do. They are tools performing a task, so the laws apply to them as they apply to, well, machines; copying and reproducing whole code blocks or novel chapters without attribution or licenses is something we allow a human to do in their head and not what we allow a machine to do in a prompt, regardless of the non-human mechanisms in between.

Human or not doesn't seem relevant in any jurisdiction which emphasizes copyright as a means to the end of "promoting useful art", as the US Constitution does, rather than an end in itself.

The moral calculus probably changes if machines are deemed capable of producing "useful art", as granting artists temporary monopoly ceases to become the only mechanism of spurring that art.

LLMs are not humans and your anthropomorphizing argument is idiotic
Do cliff notes of books and plays infringe/need a license? If so, that seems like they’d have a possible case. Maybe. If not, well… maybe openAI infringed by not buying their original copy, but not sure that feeding it into a bunch of math is going to be copyright infringement.
> the system can accurately summarize their works and generate text that mimics their styles

Cliff notes is not what lets you replicate the style of the author etc.

And yeah, you can use "it's just feeding it into a bunch of math" to justify nearly anything that involves software including good old piracy. What matters is what math is used for. (Spoiler: line up Microsoft's pockets at the expense of actual writers in this case.)

Not at all like piracy.

When someone pirates a book, they're replacing the original without consent or remuneration to the copyright holders.

When you train an AI on the contents of a book, you're not replacing it. If someone is interested in the content, they still need to buy it. Using ChatGPT is not a substitute. If it is, they're gonna have to prove it in court, but I doubt they'll be able to.

If you can ask ChatGPT about any book contents, you don't need to get the book, and if you don't need to get the book then author got robbed, ClosedAI/MS profited.
The ability to ask people what is contained within a book isn't obviously copyright infringement.

Merely summarizing info and attributing it to the source is the basic element of learning, for both machines and human beings.

These suits are necessary becsuse it's not clear where the line is, and if ChatGPTs functions actually cross it.

What is clear is that OpenAI is doing its best to avoid infringing anyone's copyright even if it is trivial for them to do so. They have the training data so they can simply output it word for word bypass the LLM. They don't do that and further restrain their LLM from making too long recitations.

If you can trick / manipulate the LLM into giving you too much then I say that infringement is on you.

> The ability to ask people what is contained within a book isn't obviously copyright infringement.

The ability to ask a commercial product is. In fact, feeding the book to that commercial product is already infringement.

ClosedAI is doing squat. The very least they could do is ask authors for permission, and of course if they really cared they would have LLM infer attribution and revenue share with the original creators.

I think it's important to distinguish between content and presentation. Most books don't offer entirely new content, but (at best) give some novel way of presenting old content. Consider a modern retelling of Greek Mythology. The stories weren't the original contribution by the author (but by the Ancient Greeks), but the particular way they tell it may be. So ChatGPT telling people about its "content" is unproblematic if it's just telling people how the story goes, and only potentially problematic if it's effectively quoting from the book or mimicking its presentation. (And we all know that if ChatGPT is good at one thing, it's paraphrase or re-expressing the same ideas in substantially different ways, so even if ChatGPT literally copies a book's presentation/wording, that would probably have happened by accident rather than necessity)

The vast majority of publications (especially those of a explanatory nature) do not contribute original content/information. The exceptions are things like research articles/monographs, historical records, government reports. But copyright infringement doesn't apply here because these things weren't published with a profit motive but precisely to publicize the information as widely as possible. The only problem area I can think of involves books published by commercial publishers which promise 'exclusive peek' into the life of some famous person (think biographies of celebrities or books like Fire and Fury). In that kind of case there is indeed original content, and revealing it in detail will arguably mean less sales for the authors/publishers.

it appears from your emphasis that you are arguing generally that "originality" and personal authorship are rare in practice, and therefore imply that mixing in training is "mostly not infringement"

I disagree with this emphasis, given that rote, repetitive or technical material that is not original authorship is not in peril. Human authors who wrote original creative content, or wrote in a style that is personal and widely recognized, their rights to trade and commerce are in peril. That is much more important over the long term, and is not worth losing for convenient information mixers.

I can also go to the Wikipedia article about any book of note and get a plot or other summary and other information about the book. If that’s the reason for buying the book, “the author got robbed.”
If you ask a human questions about a book, thereby avoiding having to buy the book for class, did you rob the author?
> If you ask a human questions about a book, thereby avoiding having to buy the book for class, did you rob the author?

If someone makes a commercial activity of "answering any question about book contents at any time 24/7", hires tons of people to read those books and reply to billions of such questions daily thereby helping everyone not buy any books, is that robbing book authors?

Food for thought.

is a sale forced or coerced, also comes to mind. Tales of college undergrads forced to buy hundreds of dollars worth of books for single semester come to mind...

but let's be direct - are we talking about market share in the millions of views, where pirate copies are also available, or the sale of any books at all compared to a few hundred over a year. Quite the difference on a subsistence level of an individual author, no?

What sort of questions? How would you know what to ask it, unless maybe you have another source for the book?

Curiously, when I ask GPT-4 about some well-known but under-copyright book, it says it can't answer because of the copyright. For well-known books out of copyright such as Alice in Wonderland, it can recite passages but tends to get lost and start reciting another section or book at some point. Would be real frustrating to use as a substitute.

This reminds me of the tenuous RIAA claim that every pirated piece of media represented a lost sale back when they were suing their customers in the 2000s.
It's been something of a wild ride for me having lived through the "Information wants to be free" era to now live in the new "Reading my publicly published writings and deriving new things from that is theft" era. The next few years of court battles around this are going to be interesting, and I'm not too hopeful on the odds that the "little guy" wins in the end. Seemingly "little guy" affirming results might just turn around and further entrench large players instead.
Are copyright holders being robbed by professors that answer their students' questions?

Don't teachers do the same?

- Trained their minds on existing books

- Tutor the next generation of students

- Give classes on book contents

- Answer questions about those books

The book publishing industry didn't go out of business because there are teachers answering questions. To the contrary, it benefited book sales, because most people aren't good self-learners.

What's wrong with having a machine do the same?

> Don't teachers do the same?

> - Trained their minds on existing books

Training a human = enriching conscious human mind. "Training" AI = mechanically creating a derivative work (no conscious mind to enrich). Training a human is the same to "training" AI as killing a human to "killing" a Unix process, same word different things

Copyright doesn't protect style or genre [1], so these suits seem destined to fail. That said, it seems like it is time to reexamine those laws in light of current technology before it kills off creative works.

https://creativecommons.org/2023/03/23/the-complex-world-of-....

The fair use argument is quite strong

If you dissect the plaintiffs claim they are arbitrarily conflating training and regurgitating

Training is using for criticism and comparison purposes, hence fair use

And there is no lawsuit against what it regurgitates and the purpose of its output, whether someone asks it to give a list for comparison purposes, or specifically asks it for a story that has a plagiarized result

I'm miffed. I tried a couple characters from my books, and zilch:

===== who is dan markunas

ChatGPT I'm sorry, but I don't have any information on a person named Dan Markunas in my database ....

who is janet saunders ChatGPT I'm sorry, but I don't have any specific information about a person named Janet Saunders in my database,

===========

Your book was published somewhere mid-2021, right?
Two books. The second was after their cutoff date.
this is a great lawsuit! if you read the complaint, they catch OpenAI dead-to-rights .. asking about plot details with names from the books, asking to write a paragraph in the style of the author in that book, and a diversity of authors that shows social awareness.. great support for this from California
If you go to wikipedia and look up a book, you'll likely find plenty of plot details, including character names. Is this also infringing?

As far as style goes, copyright doesn't protect that. Trademark MIGHT if your style is distinctive enough to be a trademark (and is used as such), but the "style" of a writer is largely about tempo and word choices, none of which are subject to copyright protections.

I think we are now reproducing multiple generations of debate on this topic, in a few go-rounds.. Let's note that among the four largest economies in the world, they each have different rules for this.
Do any of those economies really have laws protecting an author's "style"? Because I'd really like to see a legal definition of an author's style, and a case that found someone guilty of infringing on that style (separate from trademark and copyright of course)
I'm waiting for the day OpenAI sues humans for infringement of it's prompt output
It's akin to suing a person for memorizing things from a book. Don't complain, go write something.
Is there a list of the critters suing AI companies so I can boycott them?
Why would you want to do that?
AI is the next industrial revolution. It will greatly increase human productivity. Anyone standing against it is an enemy of all mankind.