Hacker News new | ask | show | jobs
by AnthonyMouse 495 days ago
> License what? Every available copyrighted work? Even getting a tiny fraction is not practical.

They don't need every copyrighted work and getting a fraction is entirely practical. They would go to some large conglomerate like Getty Images or large publishers or social media whose terms give the site a license to what you post and then the middle men would get a vig and the original authors would get peanuts if anything at all.

But in aggregate it would price out the little guy from creating a competing model, because each creator getting $3 is nothing to the creator but is real money to a small entity when there are a billion creators.

1 comments

Applying copyright law more and more to things like software - and now to AI models - in other words, the status quo, makes little sense.

What is needed instead (I doubt politicians read HN, but someone go and tell them) is a new law that regulates training of these models if we want them to exist and be used in a legally safe way. This is needed for example because most jurisdictions have different copyright laws from one another, but software travels globally.

It would make sense to make all books available for non-commercial, perhaps even commercial R&D in AI, if society elected that to be beneficial in the same way that publishers must donate one copy of each new work to a copyright library (Library of Congress Library in the US, Oxford and Cambridge University libraries and British Library in the UK, Frankfurt and Leipzig Nationalbibliotheken for Germany etc.). Just add extra provisions that they need to send a plain text copy to the Linguistic Data Consortium (LDC), which manages datasets for NLP. Like for fair use, there can be provisions to make up for that use that happen automatically in the background (in some countries the price of photocopying machine includes a fee that gets passed on to copyright holders).

Otherwise you'll have one LLM being legal in one country but illegal in another because more than 15% from onw book were in the training data, and other messy situations.