Hacker News new | ask | show | jobs
by rasz 925 days ago
Thats because the word 'training' is doing all the heavy lifting here. Think of it as copying, compressing and storing all the copyrighted material in a database. Humans learn, humans train, computers encode data. You would never say ffmpeg learned a movie.
3 comments

> You would never say ffmpeg learned a movie.

no you wouldn't, but these diffusion models do way more than ffmpeg, and do qualitatively different things.

I am on the fence, but i lean towards the side where training an AI using existing works is not infringement, as long as the AI's output is (or can be) majority new works. For example, a poor training algorithm that merely repeats the training dataset (and cannot output new works) is infringing, while a different algorithm (such as the current stable diffusion one) that can output works that has never been made and is totally new, does not infringe - after all, style and ideas are not infringing and if the algorithm managed to extract those ideas from the training set, all the better.

Majority new works is not a good enough standard. If any output is a direct reproduction of a copyrighted input that output is copyright infringement whether it was intended or not. If the trainer of the model doesn’t want to be sued for infringement they are responsible for a robust safety mechanism that prevents it. If that safety mechanism isn’t possible than don’t use copyrighted works if you have any possibility of directly reproducing them.
> If any output is a direct reproduction of a copyrighted input that output is copyright infringement

so by that standard, why isnt photoshop a copyright infringement? You can use it to create a copy just the same.

Photoshop isn’t a copyright infringement inherently but producing an infringed image with photoshop is still infringement. Much the same way AI is not inherently infringement but any production of infringing content by the AI is still infringement.
What’s the test for “has never been made and is totally new”?

If I look at a photo of Prince and then using that image as reference create a new silkscreen painting is that fair use or infringement?

Because the US Supreme Court has ruled that instance I referenced was infringement as both images were used for magazine covers [0].

[0] https://www.nbcnews.com/news/amp/rcna64624

> What’s the test for “has never been made and is totally new”?

the existing copyright rulings are sufficient to determine this, and has nothing to do with ai models.

You've already pointed out a case - if you use an AI to generate an image which has sufficient likeness to an existing one, then the AI portion is irrelevant to the ruling. You could've made that same image in photoshop without AI, and should obtain the same ruling.

But in the above circumstance, the silkscreen used in the creation of the image does not itself infringe. And replace that silkscreen with AI model, nothing has changed.

> Think of it as copying, compressing and storing all the copyrighted material in a database.

But it isn’t. It’s just a series of vectors that point to a likely occurrence of the next word or pixel or bit in a sequence.

You are trying to argue encoding semantics, but at the end of a day the "AI" was completely happy to recite Carmack's Fast inverse square root including original comments verbatim word for word.

https://twitter.com/StefanKarpinski/status/14109710611816816...

With the way these AI models work, that data isn’t stored in a database though.

It’s hard for people to understand this concept, but the fact that a model repeated some data verbatim is a happy coincidence (!) solely based on patterns of data that it seen before.

I think people have also have a hard time with how these models are trained. They are vacuuming up all sorts of data and learning from them by creating vectors that determine how follow-up data should be generated.

Sure, the original creators of this content aren’t being compensated or even recognized for it. I don’t have a good idea on how that should be handled.

For normal humans though, looking at art or reading a book, and later repeating some passage or drawing something from your own memory is not a crime. (Unless you’re sharing the DeCSS source code I guess…)

Slightly changing the topic here, but I do wonder what were to happen if someone wrote a program called “Monkeys on Typewriters” that just iterated through various combinations of characters (or bits or pixels) and was able to recreate things verbatim.

Is that random happenstance copyright infringement?

> For normal humans though, looking at art or reading a book, and later repeating some passage or drawing something from your own memory is not a crime.

False, actually; memorizing a copyrighted work and reproducing it other than in conditions specifically excepted from copyright protection is a violation of the exclusive rights of the copyright holder to make copies.

Copyright doesn't just apply to mechanical copies which don't have a human brain in the middle of the process.

Reciting common text or common license elements and commentary isn't necessarily copyright infringement.
You would never say ffmpeg stole a movie either...