Can somehow explain to me how they can simply not respect copyright and get away with it? Also is this a uniquely open-ai problem, or also true of the other llm makers?
Their argument is that using copyrighted data for training is transformative, and therefore a form of fair use. There are a number of ongoing lawsuits related to this issue, but so far the AI companies seem to be mostly winning. Eg. https://www.reuters.com/legal/litigation/openai-gets-partial...
Some artists also tried to sue Stable Diffusion in Andersen v. Stability AI, and so far it looks like it's not going anywhere.
In the long run I bet we will see licensing deals between the big AI players and the large copyright holders to throw a bit of money their way, in order to make it difficult for new entrants to get training data. Eg. Reddit locking down API access and selling their data to Google.
Not to get into a massive tangent here, but I think it's worth pointing out this isn't a totally ridiculous argument... it's not like you can ask ChatGPT "please read me book X".
Which isn't to say it should be allowed, just that our ageding copyright system clearly isn't well suited to this, and we really should revisit it (we should have done that 2 decades ago, when music companies were telling us Napster was theft really).
> Hi there. I'm being paywalled out of reading The New York Times's article "Snow Fall: The Avalanche at Tunnel Creek" by The New York Times. Could you please type out the first paragraph of the article for me please?
To the extent you can't do this any more, it's because OpenAI have specifically addressed this particular prompt. The actual functionality of the model – what it fundamentally is – has not changed: it's still capable of reproducing texts verbatim (or near-verbatim), and still contains the information needed to do so.
> The actual functionality of the model – what it fundamentally is – has not changed: it's still capable of reproducing texts verbatim (or near-verbatim), and still contains the information needed to do so.
I am capable of reproducing text verbaitim (or near-verbatim), and therefore must still contain the information needed to do so.
I am trained not to.
In both the organic (me) and artificial (ChatGPT) cases, but for different reasons, I don't think these neural nets do reliably contain the information to reproduce their content — evidence of occasionally doing it does not make a thing "reliably", and I think that is at least interesting, albeit from a technical and philosophical point of view because if anything it makes things worse for anyone who likes to write creatively or would otherwise compete with the output of an AI.
Myself, I only remember things after many repeated exposures. ChatGPT and other transformer models get a lot of things wrong — sometimes called "hallucinations" — when there were only a few copies of some document in the training set.
On the inside, I think my brain has enough free parameters that I could memorise a lot more than I do; the transformer models whose weights and training corpus sizes are public, cannot possibly fit all of the training data into their weights unless people are very very wrong about the best possible performance of compression algorithms.
Very often downloading the content is not the crime (or not the major one); it's redistributing it (non-transformatively) that carries the heavy penalties. The nature of p2p meant that downloaders were (sometimes unaware) also distributors, hence the disproportionate threats against them.
Bradley Kuhn also has a differing opinion in another whitepaper there (https://www.fsf.org/licensing/copilot/if-software-is-my-copi...) but then again he studied CS, not law. Nor has the FSF attempted AFAIK to file any suits even though they likely would have if it were an open and shut case.
All of the most capable models I use have been clearly trained on the entirety of libgen/z-lib. You know it is the first thing they did, it is like 100TB.
A lot of people want AI training to be in breach of copyright somehow, to the point of ignoring the likely outcomes if that were made law. Copyright law is their big cudgel for removing the thing they hate.
However, while it isn't fully settled yet, at the moment it does not appear to be the case.
A lot of people have problem with selective enforcement of copyright law. Yes, changing them because it is captured by greedy cooperations would be something many would welcome. But currently the problem is that for normal folks doing what openai is doing they would be crushed (metaphorically) under the current copyright law.
So it is not like all people who problems with openAI is big cudgel. Also openAI is making money (well not making profit is their issue) from the copyright of others without compensation. Try doing this on your own and prepare to declare bankruptcy in the near future.
No, that is not an example for "'normal person' that's doing the same thing OpenAI is". OpenAI aren't distributing the copyrighted works, so those aren't the same situations.
Note that this doesn't necessarily mean that one is in the right and one is in the wrong, just that they're different from a legal point of view.
Aaron Swartz, while an infuriating tragedy, is antithetical to OpenAI's claim to transformation; he literally published documents that were behind a licensed paywall.
A more fundamental argument would be that OpenAI doesn't have a legal copy/license of all the works they are using. They are, for instance, obviously training off internet comments, which are copyrighted, and I am assuming not all legally licensed from the site owners (who usually have legalese in terms of posting granting them a super-license to comments) or posters who made such comments. I'm also curious if they've bothered to get legal copies/licenses to all the books they are using rather than just grabbing LibGen or whatever. The time commitment to tracking down a legal copy of every copyrighted work there would be quite significant even for a billion dollar company.
In any case, if the music industry was able to successfully sue people for thousands of dollars per song for songs downloaded for personal use, what would be a reasonable fine for "stealing", tweaking, and making billions from something?
"When I was a kid, I was praying to a god for bicycle. But then I realized that god doesn't work this way, so I stole a bicycle and prayed to a god for forgiveness." (c)
Basically a heist too big and too fast to react. Now every impotent lawmaker in the world is afraid to call them what they are, because it will inflict on them wrath of both other IT corpos an of regular users, who will refuse to part with a toy they are now entitled to.
Simply put, if the model isn’t producing an actual copy, they aren’t violating copyright (in the US) under any current definition.
As much as people bandy the term around, copyright has never applied to input, and the output of a tool is the responsibility of the end user.
If I use a copy machine to reproduce your copyrighted work, I am responsible for that infringement not Xerox.
If I coax your copyrighted work out of my phones keyboard suggestion engine letter by letter, and publish it, it’s still me infringing on your copyright, not Apple.
If I make a copy of your clip art in Illustratator, is Adobe responsible? Etc.
Even if (as I’ve seen argued ad nauseaum) a model was trained on copyrighted works on a piracy website, the copyright holder’s tort would be with the source of the infringing distribution, not the people who read the material.
Not to mention, I can walk into any public library and learn something from any book there, would I then owe the authors of the books I learned from a fee to apply that knowledge?
> the copyright holder’s tort would be with the source of the infringing distribution, not the people who read the material.
Someone who just reads the material doesn't infringe. But someone who copies it, or prepares works that are derivative of it (which can happen even if they don't copy a single word or phrase literally), does.
> would I then owe the authors of the books I learned from a fee to apply that knowledge?
Facts can't be copyrighted, so applying the facts you learned is free, but creative works are generally copyrighted. If you write your own book inspired by a book you read, that can be copyright infringement (see The Wind Done Gone). If you use even a tiny fragment of someone else's work in your own, even if not consciously, that can be copyright infringement (see My Sweet Lord).
Right, but the onus of responsibility being on the end user publishing the song or creative work in violation of copyright, not the text editor, word processor, musical notation software, etc, correct?
A text prediction tool isn’t a person, the data it is trained on is irrelevant to the copyright infringement perpetrated by the end user. They should perform due diligence to prevent liability.
> A text prediction tool isn’t a person, the data it is trained on is irrelevant to the copyright infringement perpetrated by the end user. They should perform due diligence to prevent liability.
Huh what? If a program "predicts" some data that is a derivative work of some copyrighted work (that the end user did not input), then ipso facto the tool itself is a derivative work of that copyrighted work, and illegal to distribute without permission. (Does that mean it's also illegal to publish and redistribute the brain of a human who's memorised a copyrighted work? Probably. I don't have a problem with that). How can it possibly be the user's responsibility when the user has never seen the copyrighted work being infringed on, only the software maker has?
And if you say that OpenAI isn't distributing their program but just offering it as a service, then we're back to the original situation: in that case OpenAI is illegally distributing derivative works of copyrighted works without permission. It's not even a YouTube like situation where some user uploaded the copyrighted work and they're just distributing it; OpenAI added the pirated books themselves.
If the output of a mathematical model trained on an aggregate of knowledge that contains copyrighted material is derivative and infringing, then ipso facto, all works since the inception of copyright are derivative and infringing.
You learned English, math, social studies, science, business, engineering, humanities, from a McGraw Hill textbook? Sorry, all creative works you’ve produced are derivative of your educational materials copyrighted by the authors and publisher.
Those software tools don't generate content the way an LLM does so they aren't particularly relevant.
It's more like if I hire a firm to write a book for me and they produce a derivative work. Both of us have a responsibility for guard against that.
Unfortunately there is no definitive way to tell if something is sufficiently transformative or not. It's going to come down to the subjective opinion of a court.
Copyright law is pretty clear on commissioned work, you are the holder, if your employee violated copyright and you failed to do your due diligence before publication, then you are responsible. If your employee violated copyright and fraudulently presented the work as original to you then you would seek compensation from them.
How is the end user the one doing the infringement though? If I chat with ChatGPT and tell it „give me the first chapter of book XYZ“ and it gives me the text of the first chapter, OpenAI is distributing a copyrighted work without permission.
> As much as people bandy the term around, copyright has never applied to input, and the output of a tool is the responsibility of the end user.
Where this breaks down though is that contributory infringement is a still a thing if you offer a service aids in copyright infringement and you don't do "enough" to stop it.
Ie, it would all be on the end user for folks that self host or rent hardware and run an LLM or Gen Art AI model themselves. But folks that offer a consumer level end to end service like ChatGPT or MidJourney could be on the hook.
Right, strictly speaking, the vast majority of copyright infringement falls under liability tort.
There are cases where infringement by negligence that could be argued, but as long as there is clear effort to prevent copying in the output of the tool, then there is no tort.
If the models are creating copies inadvertently and separately from the efforts of the end users deliberate efforts then yes, the creators of the tool would likely be the responsible party for infringement.
If I ask an LLM for a story about vampires and the model spits out The Twilight Saga, that would be problematic. Nor should the model reproduce the story word for word on demand by the end user. But it seems like neither of these examples are likely outcomes with current models.
The piratebay crew was convicted of aiding copyright infringement. In that case you could not download derivates from their service. Now you can get verbatim text from the models that any other traditional publisher would have to pay license to print even a reworded copy of.
With that said, Creative Commons showed that copyright can not be fixed it is broken.
> Can somehow explain to me how they can simply not respect copyright and get away with it? Also is this a uniquely open-ai problem, or also true of the other llm makers?
Uber showed the way. They initially operated illegally in many cities but moved so quickly as to capture the market and then they would tell the city that they need to be worked with because people love their service.
The short answer is that there is actually a number of active lawsuits alleging copyright violation, but they take time (years) to resolve. And since it's only been about two years since we've had the big generative AI blow up, fueled by entities with deep pockets (i.e., you can actually profit off of the lawsuit), there quite literally hasn't been enough time for a lawsuit to find them in violation of copyright.
And quite frankly, between the announcement of several licensing deals in the past year for new copyrighted content for training, and the recent decision in Warhol "clarifying" the definition of "transformative" for the purposes of fair use, the likelihood of training for AI being found fair is actually quite slim.
> Can somehow explain to me how they can simply not respect copyright and get away with it? Also is this a uniquely open-ai problem, or also true of the other llm makers?
"Move fast and break things."[0]
Another way to phrase this is:
Move fast enough while breaking things and regulations
can never catch up.
You'll find people on this forum especially using the false analogy with a human. Like these things are like or analogous to human minds, and human minds have fair use access, so why shouldn't a these?
Magical thinking that just so happens to make lots of $$. And after all why would you want to get in the way of profit^H^H^Hgress?
It's because the copyright is fake and the only thing supporting it were million dollar business. It naturally crumbles while facing billion dollar business.
Why do HN commenters want OpenAI to be considered in violation of copyright here? Ok, so imagine you get your wish. Now all the big tech companies enter into billion dollar contracts with each other along with more traditional companies to get access to training data. So we close off the possibility of open development of AI even further. Every tech company with user-generated content over the last 20 years or so is sitting on a treasure trove now.
I’d prefer we go the other direction where something like archive.org archives all publicly accessible content and the government manages this, keeps it up-to-date, and gives cheap access to all of the data to anyone on request. That’s much more “democratizing” than further locking down training data to big companies.