Hacker News new | ask | show | jobs
by 0xcde4c3db 572 days ago
The claim that's being allowed to proceed is under 17 USC 1202, which is about stripping metadata like the title and author. Not exactly "core copyright violation". Am I missing something?
4 comments

I read the headline as the copyright violation claim being core to the lawsuit.
The plaintiffs focused on exactly this part - removal of metadata - probably because it's the most likely to hold in courts. One judge remarked on it pretty explicitly, saying that it's just a proxy topic for the real issue of the usage of copyrighted material in model training.

I.e., it's some legalese trick, but "everyone knows" what's really at stake.

Yeah; I think that's essentially where the disconnect is rooted for me. It seems to me (a non-lawyer, to be clear) that it's damn hard to make the case for model training necessarily being meat-and-potatoes "infringement" as things are defined in Title 17 Chapter 1. I see it as firmly in the grey area between "a mere change of physical medium or deterministic mathematical transformation clearly isn't a defense against infringement on its own" and "giant toke come on, man, Terry Brooks was obviously just ripping off Tolkien". There might be a tension between what constitutes "substantial similarity" through analog and digital lenses, especially as the question pertains to those who actually distribute weights.
I think you're at the heart of it, and you've humorously framed the grey area here and it's very weird. Sans a ruling that, for example, computers are too deterministic to be creative, copyright laws really seem to imply that LLM training is legal. Learning and then creating something new from what you learned isn't copyright infringement, so what's the legal argument here? A ruling declaring this copyright infringement is likely going to have crazy ripple effects going way beyond LLMs, something a good judge is going to be very mindful of.

Ultimately, this is probably going to require congress to create new laws to codify this.

> Learning and then creating something new from what you learned isn't copyright infringement, so what's the legal argument here?

The legal argument is that copying or creating what would otherwise be derivative works solely within a human brain is exempt because the human brain is not a medium wherein a configuration of information constitutes either a copy or a new work until it is set in another medium or performed publicly, whereas the storage of an artificial computer is absolutely such a medium (both of which are well-established law), so that the “learning” metaphor is not legally valid even if it is arguably a decent metaphor for some other purpose, furthermore, learning and then creating something new is often illegal, if the “something new” has sufficient proximity to the source material (that's the prohibition on unlicensed derivative works), and GenAI systems often do that and are (so the argument goes) sufficiently frequently used, and known to the service and model providers to be used. Intentionally to do that that, even were the training itself not a violation, the standards for contributory infringement are met in the provision of the certain models and/or services.

According to us law, is the Internet Archive a library? I know they received a DMCA excemption.

If so, you could argue that your local library returns perfect copies of copyrighted works too. IMO it's somehow different from a business turning the results of their scraping into a profit machinery.

My understanding is that there is no concept of a library license and that you just say you're a library and therefore become one, and whether your claim survives is more a product of social cultural acceptance than actual legal structures but someone is welcome to correct me.

The internet archive also scrapes the web for content, does not pay authors, the difference being that it spits out literal copies of the content it scraped, whereas an LLM fundamentally attempts to derive a new thing from the knowledge it obtains.

I just can't figure out how to plug this into copyright law. It feels like a new thing.

”Core copyright violation”, here, I think is being used relative to the claims in the case.
Violations of 17 USC 1202 can be punished pretty severely. It's not about just money, either.

If, during the trial, the judge thinks that OpenAI is going to be found to be in violation, he can order all of OpenAIs computer equipment be impounded. If OpenAI is found to be in violation, he can then order permanent destruction of the models and OpenAI would have to start over from scratch in a manner that doesn't violate the law.

Whether you call that "core" or not, OpenAI cannot afford to lose these parts that are left of this lawsuit.

“ If OpenAI is found to be in violation, he can then order permanent destruction of the models and OpenAI would have to start over from scratch in a manner that doesn't violate the law.”

That is exactly why I suggested companies train some models on public domain and licensed data. That risk disappears or is very minimal. They could also be used for code and synthetic data generation without legal issues on the outputs.

That's what Adobe and Getty Images are doing with their image generation models, both are exclusively using their own licensed stock image libraries so they (and their users) are on pretty safe ground.
That’s good. I hope more do. This list has those doing it under the Fairly Trained banner:

https://www.fairlytrained.org/certified-models

The problem is that you don't get the same quality of data if you go about it that way. I love ChatGPT and I understand that we're figuring out this new media landscape but I really hope it doesn't turn out to neuter the models. The models are really well done.
If I steal money, I can get way more done than I do now by earning it legally. Yet, you won’t see me regularly dismissing legitimate jobs by posting comparisons to what my numbers would look like if stealing I.P..

We must start with moral and legal behavior. Within that, we look at what opportunities we have. Then, we pick the best ones. Those we can’t have are a side effect of the tradeoffs we’ve made (or tolerated) in our system.

That is OpenAI's problem, not their victims'.
> he can order all of OpenAIs computer equipment be impounded.

Arrrrr matey, this is going to be fun.

People have been complaining about the DMCA for 2+ decades now. I guess it's great if you are on the winning side. But boy does it suck to be on the losing side.
And normal people can't get on the winning side. I'm trying to get Github to DMCA my own repositories, since it blocked my account and therefore I decided it no longer has the right to host them. Same with Stack Exchange.

GitHub's ignored me so far, and Stack Exchange explicitly said no (then I sent them an even broader legal request under GDPR)

When you uploaded your code to GitHub you granted them a license to host it. You can’t use DMCA against someone who’s operating within the parameters of the license you granted them.
Their stance is that GitHub revoked that license by blocking their account.
GitHub's terms of service specify the license is granted as necessary to provide the service. Since the service is not provided they don't have a license.
It won't happen. Judges only order that punishment for the little guys.
There would be a highly embarrassing walking back of such a ruling, when Sam Altman flexes his political network and effectively overrules it.

He spends his time amassing power and is well positioned to plow over a speed bump like that.

Also, is there really any benefit to stripping author metadata? Was it basically a preprocessing step?

It seems to me that it shouldn't really affect model quality all that much, is it?

Also, in the amended complaint:

> not to notify ChatGPT users when the responses they received were protected by journalists’ copyrights

Wasn't it already quite clear that as long as the articles weren't replicated, it wasn't protected? Or is that still being fought in this case?

In the decision:

> I agree with Defendants. Plai ntiffs allege that ChatGPT has been trained on "a scrape of most of the internet, " Compl. , 29, which includes massive amounts of information from innumerable sources on almost any given subject. Plaintiffs have nowhere alleged that the information in their articles is copyrighted, nor could they do so . When a user inputs a question into ChatGPT, ChatGPT synthesizes the relevant information in its repository into an answer. Given the quantity of information contained in the repository, the likelihood that ChatGPT would output plagiarized content from one of Plaintiffs' articles seems remote. And while Plaintiffs provide third-party statistics indicating that an earlier version of ChatGPT generated responses containing signifi cant amounts of pl agiarized content, Compl. ~ 5, Plaintiffs have not plausibly alleged that there is a " substantial risk" that the current version of ChatGPT will generate a response plagiarizing one of Plaintiffs' articles.

>Also, is there really any benefit to stripping author metadata? Was it basically a preprocessing step?

Have you read 1202? It's all about hiding your infringement.