Hacker News new | ask | show | jobs
by Xelynega 563 days ago
If you're going to retrain your model because of this ruling, wouldn't it make sense to remove all DMCA protected content from your training data instead of just the one you were most recently sued for(especially if it sets precedent)
4 comments

But all content is DMCA protected. Avoiding copyrighted content means not having content as all material is automatically copyrighted. One would be limited to licensed content, which is another minefield.

The apparant loophole is between copyrighted work and copyrighted work that is also registered. But registration can occur at any time, meaning there is little practical difference. Unless you have perfect licenses for all your training data, which nobody does, you have to accept the risk of copyright suits.

Yes, that's how every other industry that redistributes content works.

You have to license content you want to use, you cant just use it for free because it's on the internet.

Netflix doesn't just start hosting shows and hope they don't get a copyright suit...

In almost all cases before gen AI, scraping was found to be legal unless the bot accepted terms of service, in which case bot is bound by ToS. The biggest and most clear is [1]. People have been scraping internet for as long as internet existed.

[1]: https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

Before gen AI, scraping mostly wasn't about copyrightable data but about finding facts. Scraping doesn't magically make copyright infringement legal.
It's insane to me that people don't agree that you need to require a license to train your proprietary for-profit model on someone else's work.
It would make sense from a legal standpoint, but I don't think they could do that without massively regressing their models performance to the point that it would jeopardize their viability as a company.
I agree, just want to make sure "they can't stop doing illegal things or they wouldn't be a success" is said out loud instead of left to subtext.
It's not definitely illegal yet.
It's also definitely not not illegal either. Case law is very much tbd.
That might be the point. If your business model is built on reselling something you’ve built on stuff you’ve taken without payment or permission, maybe the business isn’t viable.
I wonder if they can say something like “we aren’t scraping your protected content, we are merely scraping this old model we don’t maintain anymore and it happened to have protected content in it from before the ruling” then you’ve essentially won all of humanities output, as you can already scrape the new primary information (scientific articles and other datasets designed for researchers to freely access) and whatever junk outputted by the content mills is just going to be a poor summarizations of that primary information.

Other factors that help this effort of an old model + new public facing data being complete, are the idea that other forms of media like storytelling and music have already converged onto certain prevailing patters. For stories we expect a certain style of plot development and complain when its missing or not as we expect. For music most anything being listened to is lyrics no one is deeply reading into put over the same old chord progressions we’ve always had. For art there are just too few of us who are actually going out of our way to get familiar with novel art vs the vast bulk of the worlds present day artistic effort which goes towards product advertisement, which once again follows certain patterns people have been publishing in psychological journals for decades now.

In a sense we’ve already put out enough data and made enough of our world formulaic to the point where I believe we’ve set up for a perfect singularity already in terms of what can be generated for the average person who looks at a screen today. And because of that I think even a lack of any new training on such content wouldn’t hurt openai at all.

> I wonder if they can say something like “we aren’t scraping your protected content, we are merely scraping this old model we don’t maintain anymore and it happened to have protected content in it from before the ruling”

I'm not a lawyer, but I know enough to be pretty confident that that wouldn't work. The law is about intent. Coming up with "one weird trick" to work-around a potential court ruling is unlikely to impress a judge.

Im not quite familiar with the google book project, but isnt this similar? Im pretty sure google got away with scanning copyrighted books in 2015 [0]

[0]: https://www.reuters.com/article/technology/google-book-scann...

They might make it work by (1) having lots of public domain content, for the purpose of training their models on basic language use, and (2) preserving source/attribution metadata about what copyrighted content they do use, so that the models can surface this attribution to the user during inference. Even if the latter is not 100% foolproof, it might still be useful in most cases and show good faith intent.
The latter one is possible with RAG solutions like ChatGPT Search, which do already provide sources! :)

But for inference in general, I'm not sure it makes too much sense. Training data is not just about learning facts, but also (mainly?) about how language works, how people talk, etc. Which is kind of too fundamental to be attributed to, IMO. (Attribution: Humanity)

But who knows. Maybe it can be done for more fact-like stuff.

Or this point, I'm sure there is more than enough publically and feely usable content to "learn how language works". There is no need to hoover up private or license-unclear content if that is your goal.
I would actually love it if that was true. It would reduce a lot of legal headaches for sure. But if that was true, why were previous GPT versions not as good at understanding language? I can only conclude that it's because that's not actually true. There's not enough digital public domain materials to train a LLM to understand language competently.

Perhaps old texts in physical form, then? It'll cost a lot to digitize that, wouldn't it? And it wouldn't really be accessible to AI hobbyists. Unless the digitization is publicly funded or something.

(A big part of this is also how insanely long copyright lasts (nearly a hundred years!) that keeps most of the Internet's material from being public domain in the first place, but I won't belabour that point here.)

Edit:

Fair enough, I can see your point. "Surely it is cheaper to digitize old texts or buy a license to Google Books than to potentially lose a court case? Either OpenAI really likes risking it to save a bit of money, or they really wanted facts not contained in old texts."

And yeah, I guess that's true. I could say "but facts aren't copyrightable" (which was supported by the judge's decision from the TFA), but then that's a different debate about whether or not people should be able to own facts. Which does have some inroads (e.g. a right against being summarized because it removes the reason to read original news articles).

> Training data is not just about learning facts, but also (mainly?) about how language works, how people talk, etc.

All of that and more, all at the same time.

Attribution at inference level is bound to work more-less the same way as humans attribute things during conversations: "As ${attribution} said, ${some quote}", or "I remember reading about it in ${attribution-1} - ${some statements}; ... or maybe it was in ${attribution-2}?...". Such attributions are often wrong, as people hallucinate^Wmisremember where they saw or heard something.

RAG obviously can work for this, as well as other solutions involving retrieving, finding or confirming sources. That's just like when a human actually looks up the source when citing something - and has similar caveats and costs.

That sounds about right. When I ask ChatGPT about "ought implies can" for example, it cites Kant.
Only half-serious, but: I wonder if they can dance with the publishers around this issue long enough for most of the contested text to become part of public court records, and then claim they're now training off that. <trollface>
Being part of a public court record doesn't seem like something that would invalidate copyright.
Re-training can be done, but, and it is not a small but, models already do exist and can be used locally suggesting that the milk has been spilled for too long at this point. Separately, neutering them effectively lowers their value as opposed to their non-neutered counterparts.
What about bombing? You could always smuggle dmca content in training sets hoping for a payout?
The onus is on the person collecting massive amounts of data and circumventing DMCA protections to ensure they're not doing anything illegal.

"well someone snuck in some DMCA content" when sharing family photos and doesn't suddenly make it legal to share that DMCA protected content with your photos...