Hacker News new | ask | show | jobs
by Baldbvrhunter 897 days ago
> or cherry-picked their examples from many attempts

even if that were true, is it really a defence?

> It said AI tools have to incorporate copyrighted works to “represent the full diversity and breadth of human intelligence and experience.”

that sounds like an appeal "our product isn't much use if we can't violate copyright"

5 comments

Yes, that's a defense. Twice over.

I bought an MP3 album from Amazon last weekend. One of the many things I got from that purchase was the ability to copy that album, which would be a copyright violation. That doesn't make the purchase unjustifiable, immoral or illegal — my actual use for the album justifies the purchase. The possible copyright violation is irrelevant.

People will try to trick you with statements that mention something bad and omit everything good. Don't let them. Think about what's omitted. Does chatgpt get anything good, legal, useful from reading NYT? I'd say it does. For example, it gets the knowledge necessary to explain things in three paragraphs, partly based on NYT articles. And partly based on Wikipedia, which in turn is based on the NYT.

OpenAI is saying that training to providing a three-paragraph summary of recent events is fair use of newspapers, and that such training is not realistically possible without copyrighted materials. It's saying that if you make copyright violations impossible instead of difficult, then you can't use the articles fairly either. Sounds persuasive to me.

There's a second aspect, less important IMO: de minimis non curat lex. "The law does not concern itself with trifles" basically. If OpenAI made it really difficult to make GTP do a certain thing, if you have to try many times and it's not even clear whether each attempt succeeded, then the possibility of doing that thing isn't a matter of law, says that principle.

It doesn't matter whether you got the ability to copy that album. It only matters whether you adhere to the licensed rights that you got when buying the album, which allows copying for your personal use. If you decide to copy the album to random other people in exchange for $10, then that's no longer covered by the license and thus illegal, even though you clearly got the ability to do that by buying the album and thus getting access to the MP3s in the first place.

The NYT does not give readers the license to recite substantial portions of their articles verbatim on their own websites, even though buying paid access to the NYT website technically gives them the ability to do so. In the same manner, OpenAI did technically gain the ability to do the same, but has not acquired the right to do so. The funny fact that you have to enter some magic words into a form on their website in order for their site to regurgitate entire article texts does not change anything; if I would do the same on my website, for example via a form that requests to enter the first 100 words verbatim before spitting out the entire text, it would obviously still be illegal. The same goes for the fact that you have to perform some attempts before one of them succeeds in reproducing the entire text correctly; I could replicate that just as well by adding an RNG that only returns the valid text in 20% of cases, and my website would still clearly be illegal.

Your browser displays a verbatim copy. We don't hold browsers responsible for any copyright violation performed by its user, even though the browser makes a local cache of the content and supports tools that strip ads from that content. We hold the users of the tool responsible, not the tool itself. ChatGPT is a tool, not a media platform.

> The funny fact that you have to enter some magic words into a form on their website in order for their site to regurgitate entire article texts does not change anything;

Yes it does. The degree to which a tool is designed to facilitate or prevent piracy definitely is going to impact fair use arguments.

Under your rudimentary explanation of the issue, the Internet Archive would not be allowed to store and serve reproductions of articles.

UMG Recordings, Inc. v. MP3.com, Inc. was decided April 28, 2000

https://en.wikipedia.org/wiki/UMG_v._MP3.com

Users could insert an audio CD into their drive, upload the ID of it to my.MP3.com and listen to them on demand.

It was ruled in favor of the record labels against MP3.com and the service on the copyright law provision of "making mechanical copies for commercial use without permission from the copyright owner." Before damage was awarded, MP3.com settled with the plaintiff, UMG Recordings, for $53.4 million, in exchange for the latter's permission to use its entire music collection.

Your comparison to the Amazon mp3s doesn't match at all the current, unprecedented situation.

The obvious issue is that copyrighted material was used without permission, and an opt-out feature was introduced without any way to remove already used training data.

If the copyrighted data that OpenAI is copying is causing financial harm to the rightful owner there is certainely grounds for copyright infringement. This is not fair use.

It is not fair to try and draw metaphors.

> The obvious issue is that copyrighted material was used without permission

I think you are also dancing around the realities of the situation, to your benefit. What does "used" mean here? If I listen to Enter Sandman by Metallica and learn you can make a cool song in Em using natural minor scale and the blues scale, then right my own unique rock song in Em using blues scales, is that "using" Metallica's song without permission?

The hidden part of all of these copyright claims against LLMs is that they want new rights, expanded rights. They want rights to learn from their material, a right that has never existed before. Sure, they try to frame it by pointing at regurgitation cases and scream "they are stealing our work!", but then they absolutely do not stop at the request that the model not regurgitate their material, but that they own all derivative thought resultant from their works.

Lets take this to the extreme, let's say we get to sentient AI. Does that mean copyright owners own the actions of all sentience AI beings, because they learn from their copyrighted material? Do you have two systems of laws that try to dictate how an artificial but sentient being can learn, and how a biological one can? Putting aside the question of the possibility and timeline of sentient artificial beings, the legal ramifications devolve into absurdity immediately if you start with what copyright holders are asking for now.

Yes! Thank you
We use copyrighted materials all the time without permission. You and I both read the Verge article without the Verge's permission. Reading is the intended and most common use of the Verge's articles, and neither of us asked for permission. I didn't print that one but I often do print, always without asking anyone's permission.

Copyright has that name because copying is exceptionally protected; general use is not.

You can argue that training is a kind of copying, since it involves copying of things from RAM to RAM, etc. I find that difficult, since we've established that e.g. this browser's copying of web page contents from RAM to RAM isn't.

If you don't argue that training is copying, then you can argue that since training is a necessary prelude to copying, it should be treated like copying legally. I disagree, because various kinds of fair use also has the training as a necessary prelude (and, uh, the purchase I mentioned could also be a necessary prelude to copying, if my goal was to copy the album).

This is apples to oranges.

You don't need permission as a human, to read content if it's freely available. It was explicitly made for us to consume with an expectation of returning value. Mainly advertising.

Laws are (in theory) put in place to ensure a fair playing ground. If Company B requires content from Company A, but is causing financial damage to company A by using it, this is not fair use.

I'd also like to add the intentions. Even if we decide to quote the paper later in the day, I would say it's fair to assume 95% of readers do not intend to copy the content for profit.

OpenAI on the other hand is explicitly intending to copy the material to reuse into its own content in millions of generations for profit.

No metaphors needed.

Could you please explain (without metaphors!) why the publishers who publish 20-page summaries of books do so legally, while GPT's reuse into its content violates copyright?
A 20-page summary of a book is a substantially different creation. They likely don't even have entire paragraphs reproduced, maybe only a few quotes. Those summaries also have deeper introspection on the overall work along with potentially critique about the work. It is a different creation even though related to another.

Exactly reproducing most of an article is vastly different from a short summary. ChatGPT was exactly reproducing large amounts or entire articles. If ChatGPT was only writing short summaries of articles or critiques about them this case would be radically different. But in the end, ChatGPT is exactly reproducing copyrighted works.

Because those are summaries.

The NYT found substantial portions of exact copies of their content being reproduced when you give ChatGPT the right prompts.

This is still apples to oranges, but I'll bite.

A summarized book can still entice a potential reader to purchase it. It's a form of advertising.

Chewing up content and spinning it without any citations does not provide the original owners any form of publicity.

Viewing copyrighted materials is basically never the problem. When we read The Verge we aren’t redistributing any of their content. We are doing exactly what they are granting a limited license to do: read the articles and use them for non-commercial purposes.

See section 14 of The Verge’s terms of use as an example: https://www.voxmedia.com/legal/terms-of-use

Free use can only override some of this: for example, I can use content from The Verge if I’m using a limited percentage of it for the purpose of critique and discussion. This application of fair use is basically the for-profit business model of YouTube channels like WatchMojo, which use small clips of movies and TV along with commentary and critique. Without that commentary or without limiting their redistribution to small portions of the work, they would be breaking copyright law.

The problem is redistribution of a substantial portion of the work. The NYT has allegedly found some very damning instances where ChatGPT provided answers containing substantial almost unchanged portions of text directly copied from NYT articles. NYT never granted ChatGPT any license to redistribute their content for commercial purposes, and it doesn’t seem like ChatGPT is doing anything covered by fair use (such as providing discussion or commentary).

I’m not aware of any part of copyright law that gives an infringing party a pass just because someone pressured them into infringement.

Your main example is completely wrong.

Websites like verge have terms of use one is legally obliged to follow to have permission to view and use their site.

But it's not. It's like Claudine Gay saying that they attackers are just cherry picking the paragraphs with duplicative language. If they had only looked at the rest, they wouldn't have found plagiarism.
It definitely isn’t a good defense.

This would be like if you were able to say some magic words to goad Google Search into giving you a copy of Avatar: The Way of Water. Even worse, the movie would be a file hosted and distributed by Google directly.

they shouldn't of ever made it into a "product". it wasn't ready. this is still a new technology and a new tool in a unfinished early stage in its development. it is still in testing phase. companies shouldn't be trying to make money off of it yet.
it's an appeal to the common good. Which is persuasive and always been how i see it. I think everyone here is motivated by greed but there is a huge common good from chatgpt.being able to read news articles. Off the top of my head: fake news detection.
The government gets to violate intellectual property in the interest of national security whenever it deems fit.

I take building the first AGI to be in the same category.

If you don't happen to be "the government", it's not exactly in your power to decide that.
It’s in the government’s power to decide that AGI is worth carving out gaping exceptions for. Even for private companies.

If we were to get AGI in a few years, the ends would absolutely justify the means.

All that needs is for someone to demonstrate the viability of AGI in those circumstances.

Is all that holds back AGI the volume of data?

If so, how much data is needed?

All that holds back AGI is probably not the volume of data. We're still missing key discoveries.

But giving an LLM loads of data might turn out to have been a necessary condition on the road to developing AGI.

> I take building the first AGI to be in the same category.

That's quite the leap. Some don't even think AGI is possible. And some of those that do, don't think LLMs are on the path.

Even if we assume it is, there is a significant amount of non-copyrighted text available to train with.

The difference being ChatGPT needs text that provides value in the ChatGPT product for the general audience.