Hacker News new | ask | show | jobs
by mstolpm 1188 days ago
Stable Diffusion is trained on 2.3 billion images. It would take a human ~220 years to decide if an image should be included in the training set under the condition that a decision takes 1 second and the human works for 8 hours straight year-round.

GPT is even worse in that regard, it was literally trained on the whole web plus other sources. I'd be surprised if a judge would follow the argument that the training data was make by creative choice instead of raw crawling power. I'd argue the size of training data makes it impossible to call for a creative selection process for the data.

2 comments

That’s precisely the funny thing with copyright:

If you create a work where you can clearly tell what the source was for your inspiration because you stole from another source, it’s a violation of copyright. But if you create a work and you can not tell what the source is of your inspiration, because you stole from so many different sources, not only is not a violation of copyright, but it’s actually the creation of a new copyrighted work in its own right.

ML is short-circuiting this legal framework. Because now stealing from thousands of different authors, in a way that it’s no longer possible to tell the sources can now be done with the press of a button.

It's a thing. Someone wrote it. Raw crawling power was created. Someone made the choice to use raw crawling power. This is all a creative act.

Different models do different things. If there was no creation involved, they would not.