| HN Mirror

> https://huggingface.co/datasets/P1ayer-1/books-3/discussions...

https://transparencyreport.google.com/copyright/overview?hl=...

> It seems incredible to me to suggest that piracy wasn't involved in the collection of training data, regardless of your view on the morality or legality of it. Datasets like books 3 indisputably contained copyrighted content that was being distributed without permission from the rightsholder.

Is the Google search engine piracy?

https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

https://en.wikipedia.org/wiki/Perfect_10,_Inc._v._Amazon.com....

https://en.wikipedia.org/wiki/Field_v._Google,_Inc.

https://9to5google.com/2016/04/27/getty-images-google-piracy...

https://www.reuters.com/article/idUSN07281154/

> That's just the definition of piracy. If we can't agree on that then I'm not sure what we're doing here.

It literally isn't the definition of piracy.

Piracy exists only with regard to the legal definition: "Copyright infringement (at times referred to as piracy) is the use of works protected by copyright without permission for a usage where such permission is required, thereby infringing certain exclusive rights granted to the copyright holder, such as the right to reproduce, distribute, display or perform the protected work, or to make derivative works."

Even this definition annoys a lot of people, but I will ignore the whole "it's not theft because you're not depriving the original owner of anything" as a case of taking an analogy too literally.

> More materially to this discussion, yes, it would absolutely make a difference if the AI was only trained on licensed content. I wouldn't use it but I wouldn't have a problem with it. The issue is specifically that much of the work being used without permission is being used to replace the people who made that work, and is being used without permission. If the model is based on ethically acquired data, it would be less able to reproduce the style of specific artists. Imo, there would be more room for both kinds of art in this case.

Congratulations on being consistent, almost all the artists and authors are still permanently out of work.

Even ignoring that style isn't covered by copyright (because you could reasonably argue instead that it's a trademark and/or design right issue), most artists are already extremely poor due to oversupply by other humans.

> I'm also aware that it's not a clear cut case legally but I think AI advocates and tech enthusiasts think it's a lot more likely that AI will win in court than the actual chances. Napster took years to litigate and was eventually shutdown. There's a really good discussion about this on the decoder podcast between actual lawyers.

FWIW, I know better than to trust my own beliefs[0] about law, as (free) ChatGPT is simultaneously bad, and yet vastly better at it than me.

Likewise, I think (but hold the view weakly) the mere existence of AI at even the level it was before ChatGPT's first release, is going to force a radical change in the nature of IP laws — even then these models were too good-and-cheap for countries to not allow them, while also breaking a lot of the current assumptions about everything: https://benwheatley.github.io/blog/2022/10/09-19.33.04.html

[0] I really ought to get a T-shirt printed with "Wittgenstein was wrong!"; there are so many different ways I don't accept one of his famous quotes: https://philosophy.stackexchange.com/questions/72280/first-p...