| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by MattDaEskimo 897 days ago

Your comparison to the Amazon mp3s doesn't match at all the current, unprecedented situation.

The obvious issue is that copyrighted material was used without permission, and an opt-out feature was introduced without any way to remove already used training data.

If the copyrighted data that OpenAI is copying is causing financial harm to the rightful owner there is certainely grounds for copyright infringement. This is not fair use.

It is not fair to try and draw metaphors.

2 comments

deeviant 897 days ago

> The obvious issue is that copyrighted material was used without permission

I think you are also dancing around the realities of the situation, to your benefit. What does "used" mean here? If I listen to Enter Sandman by Metallica and learn you can make a cool song in Em using natural minor scale and the blues scale, then right my own unique rock song in Em using blues scales, is that "using" Metallica's song without permission?

The hidden part of all of these copyright claims against LLMs is that they want new rights, expanded rights. They want rights to learn from their material, a right that has never existed before. Sure, they try to frame it by pointing at regurgitation cases and scream "they are stealing our work!", but then they absolutely do not stop at the request that the model not regurgitate their material, but that they own all derivative thought resultant from their works.

Lets take this to the extreme, let's say we get to sentient AI. Does that mean copyright owners own the actions of all sentience AI beings, because they learn from their copyrighted material? Do you have two systems of laws that try to dictate how an artificial but sentient being can learn, and how a biological one can? Putting aside the question of the possibility and timeline of sentient artificial beings, the legal ramifications devolve into absurdity immediately if you start with what copyright holders are asking for now.

link

NemoNobody 895 days ago

Yes! Thank you

link

Arnt 897 days ago

We use copyrighted materials all the time without permission. You and I both read the Verge article without the Verge's permission. Reading is the intended and most common use of the Verge's articles, and neither of us asked for permission. I didn't print that one but I often do print, always without asking anyone's permission.

You can argue that training is a kind of copying, since it involves copying of things from RAM to RAM, etc. I find that difficult, since we've established that e.g. this browser's copying of web page contents from RAM to RAM isn't.

If you don't argue that training is copying, then you can argue that since training is a necessary prelude to copying, it should be treated like copying legally. I disagree, because various kinds of fair use also has the training as a necessary prelude (and, uh, the purchase I mentioned could also be a necessary prelude to copying, if my goal was to copy the album).

link

MattDaEskimo 897 days ago

This is apples to oranges.

You don't need permission as a human, to read content if it's freely available. It was explicitly made for us to consume with an expectation of returning value. Mainly advertising.

Laws are (in theory) put in place to ensure a fair playing ground. If Company B requires content from Company A, but is causing financial damage to company A by using it, this is not fair use.

I'd also like to add the intentions. Even if we decide to quote the paper later in the day, I would say it's fair to assume 95% of readers do not intend to copy the content for profit.

OpenAI on the other hand is explicitly intending to copy the material to reuse into its own content in millions of generations for profit.

No metaphors needed.

link

Arnt 897 days ago

Could you please explain (without metaphors!) why the publishers who publish 20-page summaries of books do so legally, while GPT's reuse into its content violates copyright?

link

vel0city 897 days ago

A 20-page summary of a book is a substantially different creation. They likely don't even have entire paragraphs reproduced, maybe only a few quotes. Those summaries also have deeper introspection on the overall work along with potentially critique about the work. It is a different creation even though related to another.

Exactly reproducing most of an article is vastly different from a short summary. ChatGPT was exactly reproducing large amounts or entire articles. If ChatGPT was only writing short summaries of articles or critiques about them this case would be radically different. But in the end, ChatGPT is exactly reproducing copyrighted works.

link

dangus 897 days ago

Because those are summaries.

The NYT found substantial portions of exact copies of their content being reproduced when you give ChatGPT the right prompts.

link

MattDaEskimo 897 days ago

This is still apples to oranges, but I'll bite.

A summarized book can still entice a potential reader to purchase it. It's a form of advertising.

Chewing up content and spinning it without any citations does not provide the original owners any form of publicity.

link

dangus 897 days ago

Viewing copyrighted materials is basically never the problem. When we read The Verge we aren’t redistributing any of their content. We are doing exactly what they are granting a limited license to do: read the articles and use them for non-commercial purposes.

See section 14 of The Verge’s terms of use as an example: https://www.voxmedia.com/legal/terms-of-use

Free use can only override some of this: for example, I can use content from The Verge if I’m using a limited percentage of it for the purpose of critique and discussion. This application of fair use is basically the for-profit business model of YouTube channels like WatchMojo, which use small clips of movies and TV along with commentary and critique. Without that commentary or without limiting their redistribution to small portions of the work, they would be breaking copyright law.

The problem is redistribution of a substantial portion of the work. The NYT has allegedly found some very damning instances where ChatGPT provided answers containing substantial almost unchanged portions of text directly copied from NYT articles. NYT never granted ChatGPT any license to redistribute their content for commercial purposes, and it doesn’t seem like ChatGPT is doing anything covered by fair use (such as providing discussion or commentary).

I’m not aware of any part of copyright law that gives an infringing party a pass just because someone pressured them into infringement.

link

mint2 897 days ago

Your main example is completely wrong.

Websites like verge have terms of use one is legally obliged to follow to have permission to view and use their site.

link