| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by twoodfin 1110 days ago

It actually doesn’t even matter if LLMs reproduce copyrighted data from their training. The issue is that a human copied the data from its source into memory for use in training, and this copy was likely not fair use under cases like MAI Systems.

The Supreme Court hasn’t ruled on a software case like this, as far as I know. But given the recent 7-2 decision against Andy Warhol’s estate for his copying of photographs of Prince, this doesn’t seem like a Court that’s ready to say copying terabytes of unlicensed material for a commercial purpose is OK.

I’m going to guess this ends with Congress setting up some kind of clearinghouse for copyrighted training material: You opt in to be included, you get fees from OpenAI when they use what you added. This isn’t unprecedented: Congress set up special rules and processes for things like music recordings repeatedly over the years.

https://scholarship.law.edu/cgi/viewcontent.cgi?referer=&htt...

2 comments

luma 1110 days ago

How does that align with Google Books scanning libraries full of copyrighted text, offering full reproductions of sections of the work, and then having the supreme court declare it all to be Fair Use? I think that is a far more relevant precedent here: https://en.m.wikipedia.org/wiki/Authors_Guild,_Inc._v._Googl....

link

twoodfin 1110 days ago

The Supreme Court declined to hear the case on appeal, which is a shade different from endorsing the decision after a hearing.

That being said, it doesn’t take a lot of effort to differentiate these cases. Google was indexing copyrighted works and providing access to limited extracts. They weren’t transforming them into new works and then selling access to those new works over APIs.

link

luma 1110 days ago

OpenAI is also providing access to limited extracts. Google wasn't selling this over an API, they were providing "free" access to it while displaying ads to the user. Would the courts see this manner of monetization to be different enough that settled case law wouldn't apply?

link

twoodfin 1110 days ago

OpenAI isn’t doing anything like what Google was doing with Books. It’s not hard for laymen to see that, and it’s going to be obvious to any judge who hears a case.

Imagine OpenAI had invented a software program that turned any written text into an animated cartoon enacting the text. That would obviously be creating a derivative work and outside fair use bounds. That they mix a bunch of works (copyrighted and otherwise) into a piece of software doesn’t allow them to escape that basic analysis.

Google showed a “clip” of the original work, no different in scope than Siskel & Ebert showing a clip of a film as they reviewed it. The uses are not comparable.

link

6gvONxR4sf7o 1110 days ago

Google also bought copies of each book, I believe, which makes it another step removed from standard ML practice.

link

gyudin 1110 days ago

So how is that supposed to work with people sending it legally obtained copyrighted materials for an analyze?

link

twoodfin 1110 days ago

That copy (the “send”) would be evaluated under the same fair use criteria.

“Write a review of this short story: …” – probably fine.

“Rewrite this short story to have a happier ending: …” – probably not.

link