If having copyright were a prerequisite of training data this would be true.
But in the US this hasn't been tested in the courts yet, and there's reason to think from precedent this legal argument might not hold (https://www.youtube.com/watch?v=G08hY8dSrUY - sorry don't have a written version of this).
I would imagine if we use a very strict interpretation of copyright, then things like satire or fan-fiction and fan-art would be in jeopardy.
As well as learning, as a whole.
Unless there is literally a substantial copy of some particular piece of copyrighted material, it seems to be a massive hurdle to prove that analyzing something is copyright infringement.
Most people in the fanfiction community recognize that it's probably not strictly allowed under copyright. However, the community response has generally been to do it anyway and try to respect the wishes of the author. Hence why you won't find Interview with a Vampire fanfiction on the major sites.
If anything, I think that severely hinders the pro-AI argument if fanfiction made by human authors are also bound by copyright.
ETA: I just tested it out and you can totally create Interview with a Vampire fanfiction with Bing Compose. That presumably is subject to at least as strong copyright as human authors and is thus a copyright violation.
> Copyright protection is available to the creators of a range of works including literary, musical, dramatic and artistic works. Recognition of fictional characters as works eligible for copyright protection has come about with the understanding that characters can be separated from the original works they were embodied in and acquire a new life by featuring in subsequent works.
Creating a work using Harry Potter or Darth Vader or Tarzan ("As of 2023, the first ten books, through Tarzan and the Ant Men, are in the public domain worldwide. The later works are still under copyright in the United States.") is a copyright infringement.
Creating Interview with a Vampire fan fiction with Bing - Bing didn't have any agency. The question of copyright infringement (I believe) should be only applied to entities with agency to (or not) ask for copyright infringing works.
> if we use a very strict interpretation of copyright, then things like satire ... would be in jeopardy.
Satire, criticism, reviews and journalism are explicitly permitted under fair use.
If I wish to publicly express my disdain or praise for your art, it is necessary that I can show samples / pictures/ photos when I express whatever my deal is.
The difference is when writing satire its not strictly necessary to possess the work to do so. You can merely hear of something and make a joke or a fake story. Training data on the other hand uses the actual material not some derivative you gleamed from a thousand overheard conversations.
> So if you delete your image the entire trained data set is invalid because they no longer have license to the copyright?
The portion of the training set might. The actual trained result -- the outcome of a use under the license -- would, at least arguably, not.
Of course, that's also before the whole "training is fair use and doesn't require a license" issue is considered, which if it is correct renders the entire issue moot -- in that case, using anything you have access to for training, irrespective of license, is fine.
Let's say you post an image, and I learn something by viewing it, then you delete the image. Is my memory of your now deleted image wiped along with everything I learned from viewing it?
Unfortunately computer memory, unlike your memory, is so easily wiped. Having the infrastructure in place to make sure it happens on the other hand, seems more like human memory.
How derived data is handled after copyright is revoked is a question thats hard to answer.
I suspect that the data will be deleted from the dataset, and any new models will not contain derivatives from that image.
How legal that is, is expensive to find out. I suspect you'd need to prove that your image had been used, and that it's use contradicts the license that was granted. It would take a lot of lawyer and court time to find out. (I'm not a lawyer, so there might already be case history here. I'm just a systadmin who's looking after datasets. )
postscript: something something GDPR. There are rules about processed data, but I can't remember the specifics. There are caveats about "reasonable"
But in the US this hasn't been tested in the courts yet, and there's reason to think from precedent this legal argument might not hold (https://www.youtube.com/watch?v=G08hY8dSrUY - sorry don't have a written version of this).
And the lawsuits so far aren't fairing well for those who think training should require having copyright (https://www.hollywoodreporter.com/business/business-news/sar...)