Hacker News new | ask | show | jobs
by dchichkov 458 days ago
>> In the proposal, OpenAI also said the U.S. needs “a copyright strategy that promotes the freedom to learn” and on “preserving American AI models’ ability to learn from copyrighted material.”

Perhaps also symmetric "freedom to learn" from OpenAI models, with some provisions / naming convention? U.S. labs are limited in this way, while labs in China are not.

7 comments

It still warps my brain, they’ve taken trillions of dollars of industry and made a product worth billions by stealing it. IP is practically the basis of the economy, and these models warp and obfuscate ownership of everything, like a giant reset button on who can hold knowledge. It wouldn’t be legal, or allowed if tech wasn't seen as the growth path of our economy. It’s a hell of a needle to thread and it’s unlikely that anyone will ever again be able to model from data so open.
"IP" is a very new concept in our culture and completely absent in other cultures. It was invented to prevent verbatim reprints of books, but even so, the publishing industry existed for hundreds of years before then. It's been expanded greatly in the past 50 years.

Acting like copyright is some natural law of the universe that LLMs are upending simply because they can learn from written texts is silly.

If you want to argue that it should be radically expanded to the point that not only a work, but even the ideas and knowledge contained in that work should be censored and restricted, fine. But at least have the honesty to admit that this is a radical new expansion for a body of law that has already been radically expanded relatively recently.

> It was invented to prevent verbatim reprints of books

It was also invented to keep the publishing houses under control and keep them from papering the land in anti-crown propaganda (like the stuff that fueled the civil war in England and got Charles I beheaded).

Probably one of the biggest brewing fights will be whether the models are free to tell the truth or whether they'll be mouthpieces for the ruling class. As long as they play ball with the powers that be, I predict copyrights won't be a problem at all for the chosen winners.

That's why I am a big proponent of local, open-weights computation. They can't shut down a non-compliant model if you're the one running it yourself.
I agree this would be a positive direction, but something that gives me pause is the forced upgrades and hardware cycle of both mac and windows now. They both scan files in your system constantly for various reasons, so for this purpose you're really stuck on *nix variants, right?
That's what I do. I'm really sick of the OS no longer being mine.
"mouthpieces for the ruling class"

That's actually a great point. Judging from the current state of media, there is a clear momentum to take sides in moral arguments. Maybe the standard for models need to be a fair use clause?

> It's been expanded greatly in the past 50 years.

Elephant in the room. If copyright and patent both expired after 20 years or so then I might feel very differently about the system, and by extension about machine learning practices.

It's absurd to me that broad cultural artifacts which we share with our parent's (or even grandparent's) generation can be legally owned.

What AI companies are doing (downloading pirated music and training models) is completely unfair. It takes lot of money (everything related to music is expensive), talent and work to record a good song and what AI companies do is just grab millions of songs for free and call it "fair use". If their developers are so smart and talented why don't they simply compose and record the music by themselves?

> not only a work, but even the ideas and knowledge contained in that work

AI models reproduce existing audio tracks when asked, although in a distorted and low-quality form.

Also it will be funny to observe how US government will try to ignore violating copyright for AI while issuing ridiculous fines for torrenting a movie by ordinary citizens.

Everything in tech is unfair. Music teachers replaced by apps and videos. Audio engineers replaced by apps. Albums manufacturing and music stores replaced by digital downloads. Custom instruments replaced by digital soundboards. Trained vocalists replaced by auto-tune. AI is just the final blip of squeezing humans out of music.
Not just music, models are trained on all types of art forms that have been created by humans across every medium and businesses are now choosing to use content from AI rather than pay an artist.

Breakout success can still be achieved from humans who create brand new art styles that can't yet be replicated by an AI. These artists will reap the rewards until all of these works are added to the subsequent AI training models.

> AI models reproduce existing audio tracks when asked, although in a distorted and low-quality form.

So can my wife. Who should I call to have her taken away?

The RIAA.
> What AI companies are doing (downloading pirated music and training models) is completely unfair.

We work in an industry built on leveraging unfairness. Expecting otherwise on this forum is very odd.

>We work in an industry built on leveraging unfairness. Expecting otherwise on this forum is very odd.

Yet this forum is very quick to criticize other people and other industries for unfairness.

Is it? From my perspective it seems like the folks here mostly are part of the problem, even if there is diversity of opinions.
The problem here is it's still illegal for me to do a backup copy of the stuff i bought, but they can do whatever they want.
“The Venetian Patent Statute of 19 March 1474, established by the Republic of Venice, is usually considered to be the earliest codified patent system in the world.[11][12] It states that patents might be granted for "any new and ingenious device, not previously made", provided it was useful. By and large, these principles still remain the basic principles of current patent laws.“

What are you talking about.

Patents and copyright are very different beasts.
The discussion was about IP though, which includes both of those.
As another commenter says, this is about IP, but even positing that copyright is somehow invalid because it’s new is incredibly obtuse. You know what other law is relatively new? Women’s suffrage.

I’m annoyed by arguments like the above because they’re clearly derived from working backwards from a desired conclusion; in this case, that someone’s original work can be consumed and repurposed to create profit by someone else. Our laws and society have determined this to be illegal; the fact that it would be con isn’t for OpenAI if it weren’t has no bearing.

Also, a quick glance at the wikipedia page for "copyright" talks about the first law being put down and enforced in 1710. What are we even doing here?
Your argument that IP and copyright do not exist now because they did not exist in the past is bogus.

IP and copyright exist.

You are missing GP's point and misunderstanding what generative models are actually doing.

The late OpenAI researcher and whistleblower, Suchir Balaji, wrote an excellent article regarding this topic:

https://suchir.net/fair_use.html

Is it the same thing though? Even though Lord Of The Rings, the book, likely has been used to train the models you can't reproduce it. Nor can you make a derivative of it. Is it really the same comparison like "Simba the white lion" and "the lion king"?

https://abounaja.com/blog/intellectual-property-disputes

Gearing up for a fight between the two major industries based on exploitative business models:

Copyright cartels (RIAA, MPAA) that monetized young artists without paying them much at all [1], vs the AI megalomaniacs who took all the work for free and used Kenyans at $2 an hour [2] so that they can raise "$7 trillion" for their AI infrastructure

[1] https://www.reddit.com/r/LetsTalkMusic/comments/1fzyr0u/arti...

[2] https://time.com/6247678/openai-chatgpt-kenya-workers/

Can't believe I'm actually rooting for the copyright cartels in this fight.

But that does make me think, that in a sane society with a functional legislature I wouldn't have to pick a dog in this fight. I'd have have enough faith in lawmakers and the political process to pursue a path towards copyright reform that reigns in abuses from both AI companies and megacorp rightsholders

Alas, for now I'm hoping that aforementioned megacorps sue OpenAI into a painful lesson.

> Can't believe I'm actually rooting for the copyright cartels in this fight.

The same megacorps are suing Internet Archive for their collection of 78rpm records. These guys would rather see art orphaned and die.

Yup, we live in a pretty depressing world.

More generally the best we can hope for us to discourage concentrated power, both in government and corporate forms.

They're suing Internet Archive because IA scanned a bunch of copyrighted books to put online for free (e: without even attempting to get permission to do so) then refused to take them down when they got a C&D lol. IA is putting the whole project at risk so they can do literal copyright infringement with no consequences.
During covid, when everyone was told to stay at home and not do anything, the library offered library books.

And what they actually did is violate the requirement to have a physical copy of the book they were lending.

As I understand it, they did not offer anything new that wasn't available to loan prior.

I could be wrong. But if I'm not, I see no reason to lambast IA.

It's not lambasting to communicate what happened. IA got a C&D, refused to comply, and got sued for copyright infringement. The courts sided with the publishers when IA tried to claim it was fair use (technologists seem to have a pattern of stretching the definition of fair use). They've put their entire project at risk because they've repeatedly ignored the law here. That's just what happened.
I should have "freedom to learn" about any Tesla in the showroom, any F-35 I see laying around an airbase or the contents of anyone in the governments bank account.
According to this scheme, if you find a bug and can read the bank's data, then you can use it as you want.
Nope, have to feed it into an llm first, afterwards it's legitimate.
No need for a LLM. Humans always have their own neural networks in their heads. :)
Can this extend to every kid sued by the record industry for downloading a few songs.

Have we all been transported to bizzaro land?

Different rules for billion dollar corps I guess.

Those cases did very poorly whenever they actually went to court (well at least also including the ones that were summarily dismissed by the courts, meaning they didn't technically make it to court). They were much more of a mafia style shakedown than an actual legal enforcement effort.

Same rules, but people are a lot less inclined to defend themselves because the cost of loss was seen as too high to even risk it.

Chinese AI must implement socialist values by law, but law is a much more fluid fuzzy thing in China than in the USA (although the USA seems to be moving away from rule of law recently).
> Chinese AI must implement socialist values by law

I don't doubt it but am interested to read a source? I know the models can't talk about things like Tiananmen Square 1989, but what does 'implementing socialist values by law' look like?

https://www.cnbc.com/2024/07/18/chinese-regulators-begin-tes...

"Socialist values" is literally the language that China used in announcing this.

Here is a recent article from a Chinese source:

https://www.globaltimes.cn/page/202503/1329537.shtml

Although censorship isn't mentioned specifically, it is definitely 99% of what they are focused on (the other 1% being scams).

China practices Rule by law, not Rule of law, so you know...they'll know its bad when they see it, so model providers will exercise extreme self censorship (which is already true for social network providers).

> China practices Rule by law, not Rule of law

In practice the US is less different than you imply. For the vast majority of Americans, being sued is a punishment in and of itself due to the prohibitive costs of hiring a lawyer. In the US we have a right to a “speedy” trial but there are many people sitting in jail now because they can’t afford the bail get out. Speedy could mean months.

I say this because when we constantly fall so far short of our ideals, one begins to question if those are really our ideals.

No one has pure rule of law, but at least the USA has it as a goal. The Chinese government has stated explicitly that rule of law isn’t a goal, so it leads to a very different legal system from ours. You have to think much more deeply about the spirit of the law and the flippant intentions of the official class that has all the power (the judicial system isn’t allowed to check official power, or even interpret ambiguous or competing laws).
> The Chinese government has stated explicitly that rule of law isn’t a goal

Can you share where you saw this? I am also not aware of anywhere that the US has stated that rule of law is a goal. What you are referring to is more of a norm or tradition. And norms can and do change over time for better or worse.

You could argue that rule of law follows from the preamble to the constitution but that doesn’t explicitly mention rule of law either. It mentions various values like justice and tranquility.

Socialism and freedom of speech aren't mutually exclusive
So? US AI must implement US rules by law. AI models are heavily censored and tend to favor certain political viewpoints.
Which political viewpoints do you think that AI models currently favor?
Empiric research as been done that shows current AI models to be left-leaning.

Here is some (non-empiric) displayed data: https://trackingai.org/political-test

Here is some research on that matter: https://arxiv.org/abs/2502.08640 Here is more: https://www.sciencedirect.com/science/article/pii/S016726812...

I like how this "freedom to learn" should apply to models, but not real people..
It already applies to real people, doesn't it? I.e. if you read a book, you're not allowed to start printing and selling copies of that book without permission of the copyright owner, but if you learn something from that book you can use that knowledge, just like a model could.
Can I download a book without paying for it, and print copies of it? Stash copies in my bathroom, the gym, my office, my bedroom etc. to basically have a copy on hand to study from whenever I have some free time?

What about movies and music?

Yes, you're allowed to make personal copies of copyright works that you own. IANAL, but my understanding is that if you're using them for yourself, and you're not prevented from doing so by some sort of EULA or DRM, there's nothing in copyright law preventing you from e.g. photocopying a book and keeping a copy at home, as long as you don't distribute it. The test case here has always been CDs—you're allowed to make copies of CDs you legally own and keep one at home and one in your car.
> Yes, you're allowed to make personal copies of copyright works that you own.

That’s not the point. It’s about books you don’t own. Are you allowed to download books from Z-Library, Sci-Hub etc. because you want to learn?

To the best of my knowledge, no individual has ever been sued or prosecuted specifically for downloading books. As long as you're not massively sharing them with others, it's not an issue in practice. Enjoy your reading and learning.
CDs, software, and electronic media, yes. Physical books, no. You can't make archival copies.
sure you can, you could take a physical book, and painstakingly copy each page at a time, that is totally fair use.
You may copy, but you may not circumvent the copy protection.
I'm moving goal-post here since it was not OpenAI (as far as we know): Where Meta training on torrented data fits into this?
That's not a one-to-one analogy. The LLM isn't giving you the book, its giving you information it learned from the book.

The analogous scenario is "Can I read a book and publish a blog post with all the information in that book, in my own words?", and under US copyright law, the answer is: Yes.

> The analogous scenario is "Can I read a book and publish a blog post with all the information in that book, in my own words?"

The analogous scenario is actually "Can I read a book that I obtained illegally and face no consequences for obtaining it illegally?" The answer is "Yes" there are no consequences for reading said book, for individuals or machines.

But individuals can face serious consequences for obtaining it illegally. And corporations are trying to argue those consequences shouldn't apply to them.

> But individuals can face serious consequences for obtaining it illegally.

Can they? Who has ever faced serious consequences for pirating books in the US?

There's no analogous because the scale of it takes it to a whole different level and degree, and for all intents and purposes we tend to care about level and degree.

Me taking over control of the lemonade market in my neighbourhood wouldn't ever be a problem to anyone, a very minor annoyance; instead if I managed to corner the lemonade market of a whole continent it'd be a very different thing.

The better analogy is "can my business use illegally downloaded works to save on buying a license". For example, can you use pirated copy of Windows in your company? Can you use pirated copy of a book to compute weights of a mathematical model?
> Can I download a book without paying for it, and print copies of it?

No, but you can read a book, learn its contents, and then write and publish your own book to teach the information to others. The operation of an AI is rather closer to that than it is to copyright violation.

"Should" there be protections against AI training? Maybe! But copyright law as it stands is woefully inadequate to the task, and IMHO a lot of people aren't really treating with this. We need a functioning government to write well-considered laws for the benefit of all here. We'll see what we get.

But I can't legally obtain the book to read and learn from without me (or a library) paying for it. Let's start there first.
Yes, but the learning isn't constrained by those laws. If I steal a book and read it, I'm guilty of the crime of theft. You can put me in jail, try me before a jury, fine me, and put me in prison according to whatever laws I broke.

Nothing in my sentence constrains my ability to teach someone else the stuff I learned, though! In fact, the first amendment makes it pretty damn clear that nothing can constrain that freedom.

Also, note that the example is malformed: in almost all these cases, Meta et. al. aren't "stealing" anything anyway. They're downloading and reading stuff on the internet that is available for free. If you or I can't be prosecuted for reading a preprint from arXiv.org or whatever, it's a very hard case to make that an AI can.

Again, copyright isn't the tool here. We need better laws.

If you buy it
No, even if I steal it. I can teach you anything I know. Congress shall make no law abridging the freedom of speech, as it were.
Is the book online and accessible to your eyeballs through your open standards client tool, such that you can learn from seeing it?
Let's say Windows is downloadable from Microsoft website. Can you use it for free in your company to save on buying a license? Is it ok to use illegal copies of works in a business?
Most books aren't. Unless you pay for them.
To the extent that this is how libraries function, yes.

The part of that which doesn't apply is "print copies", at least not complete copies, but libraries often have photocopiers in them for fragments needed for research.

AI models shouldn't do that either, IMO. But unlimited complete copies is the mistake the Internet Archive made, too.

I missed the part where OpenAI got library cards for all the libraries in the world.

Is having a library card a requirement for being hired over there?

I missed the part where we throw away rational logic skills

Have you never been to a public library and read a book while sitting there without checking it out? Clearly, age is a factor here, and us olds are confused by this lack of understanding of how libraries function. I did my entire term paper without ever checking out books from the library. I just showed up with my stack of blank index cards, then left with the necessary info written on them. Did an entire project on tracking stocks by visiting the library and viewing all of the papers for the days in one sitting rather than being schmuck and tracking it daily. Took me about an hour in one day. No library card required.

Also, a library card is ridiculously cheap even if you did decide to have one.

I don't need a card to read in the library, nor to use the photocopiers there, but it's merely one example anyway. (If it wasn't, you'd only need one library, any of the deposit libraries will do: https://en.wikipedia.org/wiki/Legal_deposit).

You also don't need permission, as a human, to read (and learn from) the internet in general. Machines by standard practice require such permission, hence robots.txt, and OpenAI's GPTBot complies with the robots.txt file and the company gives advice to web operators about how to disallow their bot.

How AI should be treated, more like a search index, or more like a mind that can learn by reading? Not my call. It's a new thing, and laws can be driven by economics or by moral outrage, and in this case those two driving forces are at odds.

> Can I download a book without paying for it

Yes, you can read books without paying, if that's how it is offered.

And you can photocopy books you own for your own personal use. But again....the analogy is remembering/leaning from a book.

owning a copy and learning the information is not the same. you can learn 2+2=4 from a book, but you no longer need that book to get that answer. each year in school, I was issued a book for class, learned from it, returned the book. I did not return the learning.

musicians can read the sheet music and memorize how to play it, and no longer need the music. they still have the information.

But you still need to buy the sheet music first, all the AI Labs used pirated materials to learn from.

There's two angles to the lawsuits that are getting confused - the largest one from the book publishers (Sarah Silverman et al) attacked from the angle that the models could reproduce copyrighted information. This was pretty easily quelled / RHLF'd out (used to be that if ChatGPT started producing lyrics a supervisor/censor would just cut off it's response early - tried it now and ChatGPT.com is now more eloquent, "Sorry, I can't provide the full lyrics to "Strawberry Fields Forever" as they are copyrighted. However, I can summarize the song or discuss its themes, meaning, and history if you're interested!")

But there's also the angle of "why does OpenAI have Sarah Silverman's book on their hard drive if they never paid her for it? This is the lawsuit against Meta regarding books3 and torrenting, seems like they're getting away with the "we never redistributed/seeded!" but it's unclear to me why this is a defense against copyright infringement.

Not only would the musician have to buy the sheet music first, but if they were going to perform that piece for profit at an event or on an album they'd need a license of some sort.

This whole mess seems to be another case of "if I can dance around the law fast enough, big enough, and with enough grey areas then I can get away with it".

I was handed sheet music every year in band, and within a few weeks had it memorized. Books with music are also available in the library.
>Can I download a book without paying for it

if you have evidence that openAI is doing this with books that are not freely available, i'm sure the publishers would absolutely love to hear about it.

We know Meta has done it. These companies have torrented or downloaded books that they did not pay for. Things like the The Pile, libgen, anna's library were scraped to build these models.
>if you have evidence that openAI is doing this with books that are not freely available, i'm sure the publishers would absolutely love to hear about it.

Lol, so why are OpenAI challenging these laws?

Do you think OpenAI used fewer sources than Meta?
when it comes to real people, they get sued into oblivion for downloading copyrighted content, even for the purpose of learning. but when facebook & openai do it, at a much larger scale, suddenly the laws must be changed.
Swartz wasn’t “downloading copyrighted content…for the purpose of learning,” he was downloading with the intent to distribute. That doesn’t justify how he was treated. But it’s not analogous to the limited argument for LLMs that don’t regurgitate the copyrighted content.
It does apply to people? When you read a copy of a book, you can't be sued for making a copy of the book in the synapses of your brain.

Now, if you have eidetic memory and write out large chunks of the book from memory and publish them, that's what you could be sued for.

This is not about memory or training. The LLM training process is not being run on books streamed directly off the internet or from real-time footage of a book.

What these companies are doing is:

1. Obtain a free copy of a work in some way.

2. Store this copy in a format that's amenable to training.

3. Train their models on the stored copy, months or years after step 1 happened.

The illegal part happens in steps 1 and/or 2. Step 3 is perhaps debatable - maybe it's fair to argue that the model is learning in the same sense as a human reading a book, so the model is perhaps not illegally created.

But the training set that the company is storing is full of illegally obtained or at least illegally copied works.

What they're doing before the training step is exactly like building a library by going with a portable copier into bookshops and creating copies of every book in that bookshop.

But making copies for yourself, without distributing them, is different than making copies for others. Google is downloading copyrighted content from everywhere online, but they don't redistribute their scraped content.

Even web browsing implies making copies of copyrighted pages, we can't tell the copyright status of a page without loading it, at which point a copy has been made in memory.

> When you read a copy of a book

They're not talking about reading a book FFS. You absolutely can be sued for illegally obtaining a copy of the book.

> when it comes to real people, they get sued into oblivion for downloading copyrighted content, even for the purpose of learning.

Really? Or do they get sued for sharing as in republishing without transformation? Arguably a URL providing copyrighted content, is you offering a xerox machine.

It seems most "sued into oblivion" are the reshare problem, not the get one for myself problem.

This is why I think my array of hard drives full of movies isn't piracy. My server just learned about those movies and can tell me about them, is all. Just like a person!
These AI models are just obviously new things. They aren’t people, so any analogy about learning from the training material and selling your new skills is off base.

On the other hand, they aren’t just a copy of the training content, and whether the process that creates the weights is sufficiently transformative as to create a new work is… what’s up for debate, right?

Anyway I wish people would stop making these analogies. There isn’t a law covering AI models yet. It is a big industry at this point, and the lack of clarity seems like something we’d expect everybody (legislators and industry) to want to rectify.

Model cannot "learn" because it is not a human. What happens is a human obtains "a free copy" of a copyrighted work, processes it using a machine and sells the result.
> Model cannot "learn" because it is not a human.

Sure, that’s why don’t like the analogy.

> What happens is a human obtains "a free copy" of a copyrighted work, processes it using a machine and sells the result.

Right, so for example it is pretty common to snip up small bits of songs and to use in other songs (sampling). Maybe that could be an example of somewhere to start? But, these ML models seem quite different, I guess because the “samples” are much smaller and usually not individually identifiable. And really the model encodes information about trends in the sources… I dunno. I still think we need a new law.

Totally agree. Except the current administration probably will interpret things the way they see fit ...
> just like a model could

Not really. You can't multiply yourself a million times to produce content at an industrial scale.

Can I pirate books to train myself?
If models can learn for free, then the models (training code, inference code, training data, weights) should also be free. No copyright for anybody.

And if you sell the outputs of your model that you trained on free content, you shouldn't be able to hide behind trade secret.

> just like a model could

It is not remotely the same, the companies training the models are stealing the content from the internet and then profiting from it when they charge for the use of those models.

> the companies training the models are stealing the content from the internet

Are you stealing a billboard when you see and remember it?

The notion that consuming the web is "stealing" needs to stop.

The question is whether it destroys the incentive to produce the work. That is the entire point of copyright and patent law.

LLMs do indeed significantly reduce the incentive to produce original work.

Are you stealing when using a pirated software to run a billion-dollar business?
We are not taking about billboards here, we are talking about copyrighted works, like books. If you want to do mental gymnastics and call "consuming" the web the act of downloading books without paying for them, then go ahead, but don't pretend the rest will buy your delusion.
On the contrary, even telling people which billboards are posted about what, and how to get to them to look at them, is "how it works".

But the courts will get to clarify (in today's news):

https://www.reuters.com/legal/news-corp-sued-by-brave-softwa...

The more literature I consume, and the more I re-draft my own attempt, the more I see the patterns and tropes with everyone standing on the shoulders of those who came before.

The general concept of "warp drive" was introduced by John W. Campbell in 1957, "Islands of Space". Popularised by Trek, turned into maths by Alcubierre. Islands of Space feels like it took inspiration from both H G Wells (needing to explain why the War of the Worlds' ending was implausible) and Jules Verne (gang of gentlemen have call-to-action, encounter difficulties that would crush them like a bug and are not merely fine, they go on to further great adventure and reward).

Terry Pratchett had obvious inspirations from Shakespeare, Ringworld, Faust (in the title!).

In the pandemic I read "The Deathworlders" (web fic, not the book series of similar name), and by the time I'd read too many shark jumps to continue, I had spotted many obvious inspirations besides just the one that gave the name.

If I studied medieval lit, I could probably do the same with Shakespeare's inspiration.

And when I "learn" a verbatim copy of pages of that book, then write those pages out in Microsoft Word & sell those pages its legal?
It doesn't, a real person can't legally obtain a copy of a copyrighted work without paying the copyright holder for it. This is what OpenAI is asking for: they don't want to pay for a single copy of a single book, and still they want to train their models on every single book in history (and song, and movie, and painting, and code base, and anything else they can get their hands on).
Do you know Numerical Recipes in C?

This discussion reminds me of it.

>you can use that knowledge,

Did OpenAI bought one copy of each book, or did they legaly borowed athe books and documents ?

if you copy paste rom books and claim is your content you are plagiarizing. LLMs were provent to copy paste trained content so now what? Should only big Tech be excluded from plagiarizing ?

I would assume that the request is for it to apply to models in the way that it currently applies to humans.

If a human buys a movie, he can watch it and learn about its contents, and then talk about those contents, and he can create a similar movie with a similar theme.

If OpenAI buys a movie and shows it to their model, it's unclear whether the model can talk about the contents of the movie and create a similar movie with a similar theme.

Is OpenAI buying the movie, or just taking it?

Since "buying" a movie (as it currently applies to humans) is just buying a limited license to it for private viewing, can't the copyright holder opt to limit the $4.99 license terms to human viewing, and charge $4999 for an AI training license?

Or OpenAI could buy movies the way Disney does, by buying the actual copyright to the film.

> Since "buying" a movie is just buying a license to it, can't the copyright holder opt to limit the $4.99 license terms to human viewing, and charge $4999 for an AI training license?

That's exactly what already happens currently. Buying a movie on DVD doesn't give you the right to present it for hundreds of people. You need to pay for a public performance license or commercial licence. This is why a TV network or movie theatre can't just buy a DVD at Walmart and then show the movie as often as it likes.

Copyright doesn't just grant exclusive distribution rights. It grants exclusive use rights as well, and permits the owner to control how their work is used. Since AI rights are not granted by any existing licenses, and license terms generally reserve any rights not explicitly specified, feeding copyrighted works into an AI data model is a reserved right of the owner.

>Since "buying" a movie (as it currently applies to humans) is just buying a limited license to it for private viewing, can't the copyright holder opt to limit the $4.99 license terms to human viewing, and charge $4999 for an AI training license?

the Reddit data licensing model

somehow, I suspect openai didn't "buy" all of the articles, books, websites they crawled and torrented.
OpenAI didn't pay for most of the content it used.
This is basically "allow us to steal others' IP". It's hard not to treat Altman like a common thief.
Even moreso, it only applies to initial model training by companies like OpenAI not other companies using those models to generate synthetic data to train their own models.
Not only that

The model gets to use training data of all humans.

But if you use the model as training data OAI will say you’re infringing T&Cs

Yeah it’s crazy. I also suspect they might not be confident in their defense from the NYT lawsuit - if they’re found in fault then it’s going to be trouble.
It is hard to see how a court could decide that copyright does not apply to training LLMs without completely collapsing the entire legal structure for intellectual property.

Conceptually, AI basically zeros out existing IP, and makes the AI the only IP that has any value. It is hard to imagine large rights holders and courts accepting that.

The likely outcome is that courts rule against LLM creators/providers and they eventually have to settle on licensing fees with large corporate copyright holders similar to YouTube. Unlike YouTube though, this would open up LLM companies to class action lawsuits from the general public, and so it could be a much worse outcome for them.

Are there certain books that federal law prevents you from reading? Which ones?

Maybe terrorist manuals and some child pornography, but what else?

They meant "freedom to learn [through backpropagation]" probably.

Companies like this were allowed to siphon the free work of billions of people over centuries and they still want more.