Hacker News new | ask | show | jobs
by echelon 490 days ago
If the copyright holders win, the model giants will just license.

This effectively kills open source, which can't afford to license and won't be able to sublicense training data.

This is very bad for democratized access to and development of AI.

The giants will probably want this. The giants were already purchasing legacy media content enterprises (Amazon and MGM, etc.), so this will probably further consolidation and create extreme barriers to entry.

If I were OpenAI, I'd probably be very happy right now. If I were a recent batch YC AI company, I'd be mortified.

5 comments

License what? Every available copyrighted work? Even getting a tiny fraction is not practical.

To the contrary, this just means companies can't make money from these models.

Those using models for research and personal use wouldn't be infringing under the fair use tests.

> License what? Every available copyrighted work? Even getting a tiny fraction is not practical.

Maybe the strategy is something like this:

1) Survive long enough/get enough users that killing the generative AI industry is politically infeasible.

2) Negotiate a compromise similar to the compulsory mechanical royalty system used in the music business to “compensate” the rights holders whose content is used to train the models

The biggest AI companies could even run the enforcement cartels ala BMI/ASCAP to compute and collect royalties owed.

If you take this to its logical conclusion, the AI companies wouldn’t have to pre-license anything, and would just pay out all the royalties to the biggest rights holders (more or less what happens in the music industry) on the basis that figuring out what IP went into what model output is just too hard, so instead they just agree to distribute it to whomever is on the New York Times best seller list at any given moment.

> the basis that figuring out what IP went into what model output is just too hard, so instead they just agree to distribute it to whomever is on the New York Times best seller list at any given moment.

the long tail exists, and there will always be a threshold for payments due to rights holders.

it used to be (like 10 years ago so i might not remember the details exactly) that if you earned less than £1 from youtube performing music rights in a quarter then any money you earned was put back into the pot and redistributed to those earning over £1.

it just wasn’t worth the cost to keep track of £0.00001 earnings for all the rights holder in the bottom of the long tail each quarter, or to pay the bank fees when the eventually earn £0.01 that can be paid to them.

definitely not perfect, but at least some people were getting paid, instead of none.

also, youtube’s data they gave us was fairly shit (video title, url). so that didn’t help. nor did the lack of compute/data proc infrastructure/skills. was historically a manual spreadsheet job trying to work out who to cut.

i had to do it a few times :/

edit —

> The biggest AI companies could even run the enforcement cartels ala BMI/ASCAP to compute and collect royalties owed.

what could happen, for music at least, is the same thing that happened with youtube, mashed up with live music analogies.

a licensing negotiation with BMI/ASCAP/PRS, and maybe major publishers directly if they get frustrated with the PROs. then PROs will use sampling of other revenue streams to work out what the likely popular things are for AI. then divvy up whatever the lump sum is between the most popular songs.

we used to do this for live music. i had to generate the sampled dataset in microsoft access each year and weed out the all the radio stings.

sorry for costing you a million pounds that one year ed sheeran :/

> figuring out what IP went into what model output is just too hard

Check out this one cool trick companies found for skirting copyright restrictions.

Lawyers HATE them!

> License what? Every available copyrighted work? Even getting a tiny fraction is not practical.

They don't need every copyrighted work and getting a fraction is entirely practical. They would go to some large conglomerate like Getty Images or large publishers or social media whose terms give the site a license to what you post and then the middle men would get a vig and the original authors would get peanuts if anything at all.

But in aggregate it would price out the little guy from creating a competing model, because each creator getting $3 is nothing to the creator but is real money to a small entity when there are a billion creators.

Applying copyright law more and more to things like software - and now to AI models - in other words, the status quo, makes little sense.

What is needed instead (I doubt politicians read HN, but someone go and tell them) is a new law that regulates training of these models if we want them to exist and be used in a legally safe way. This is needed for example because most jurisdictions have different copyright laws from one another, but software travels globally.

It would make sense to make all books available for non-commercial, perhaps even commercial R&D in AI, if society elected that to be beneficial in the same way that publishers must donate one copy of each new work to a copyright library (Library of Congress Library in the US, Oxford and Cambridge University libraries and British Library in the UK, Frankfurt and Leipzig Nationalbibliotheken for Germany etc.). Just add extra provisions that they need to send a plain text copy to the Linguistic Data Consortium (LDC), which manages datasets for NLP. Like for fair use, there can be provisions to make up for that use that happen automatically in the background (in some countries the price of photocopying machine includes a fee that gets passed on to copyright holders).

Otherwise you'll have one LLM being legal in one country but illegal in another because more than 15% from onw book were in the training data, and other messy situations.

They didn’t train it on every available copyrighted work though, but on a specific set of legal questions and answers. And they did try to license them, and only did the workaround after not getting a license.
I think they were talking about the "model giants" like OpenAI you mentioned. Not saying they're correct, but I will concede the amount of copyrighted information someone like OpenAI would want is probably (at least) an order of magnitude more than this particular case.
> License what? Every available copyrighted work? Even getting a tiny fraction is not practical.

Oh no. Anyway.

Open source model builders are no more entitled to rip off content owners than anyone else. I couldn't possibly care any less if this impacts "democratized access" to bullshit generators. At least if the big boys license the content then the rightful owners get paid (and have the option to opt out).
The copyright lobby has really done a number on public policy. Copyright was never meant to be perpetual.

I’m good with your proposal if we also revert to the original 14 year + 14 year extension model. As it stands the 120 year copyright is so ridiculously tilted that we should not allow it to extend to veto power over technical advancements.

Legal arbitrage isn't a technical advancement. The technical advancement was all the stuff that goes into LLMs not the part where we feed ever more copyright into models for AICorp to make money.
I don't have either a data center, or every single copyrighted work in history to import as training data to train my open source model.

Whether or not OpenAI is found to be breaking the law will be utterly irrelevant to actual open AI efforts.

> If the copyright holders win, the model giants will just license.

No, they won't. The biggest models want to train on literally every piece of human-written text ever written. You can pay to license small subsets of that at a time. You can't pay to license all of it. And some of it won't be available to license at all, at any price.

If the copyright holders win, model trainers will have to pay attention to what they train on, rather than blithely ignoring licenses.

"The biggest models want to train on literally every piece of human-written text ever written"

They genuinely don't. There is a LOT of garbage text out there that they don't want. They want to train on every high quality piece of human-written text they can get their hands on (where the definition of "high quality" is a major piece of the secret sauce that makes some LLMs better than others), but that doesn't mean every piece of human-written text.

Even restricted to that narrower definition, the major commercial model companies wouldn't be able to afford to license all their high-quality human text.

OpenAI is Uber with a slightly less ethically despicable CEO.

It knows it's flaunting the spirit of copyright law -- it's just hoping it could bootstrap quickly enough to make the question irrelevant.

If every commercial AI company that couldn't prove training data provenance tomorrow was bankrupted, I wouldn't shed an ethical tear. Live by the sword, die by the sword.

Bold idea, requiring startups to proactively prove they have not broken the law. Should we apply it to all tech startups? Let’s see silicon startups prove they have not stolen trade secrets!
Open source models can crowdsource open source training data. This was done for RNNoise for example.