That's fine for training your own model, but I don't think you could distribute the training set. That seems like a clear copyright violation, against one of the groups that cares most about copyright.
Maybe you could convince a couple of indie creators or state-run programs to licence their audio? But I'm not sure if negotiating that is more efficient than just recording a bit more audio, or promoting the project to get more volunteers.
It would likely be a lot easier for someone from within the BBC, CBC, PBS, or another public broadcaster to convince their employer to contribute to the models. These organizations often have accessibility mandates with real teeth and real costs implementing that mandate. The work of closed captioning, for example, can realistically be improved by excellent open source speech recognition and TTS models without handing all of the power over to Youtube and the like.
It would still be an uphill battle to convince them to hand over the training set but the legal department can likely be convinced if the data set they contribute back is heavily chopped up audio of the original content, especially if they have the originals before mixing. I imagine short audio files without any of the music, sound effects, or visual content are pretty much worthless as far as IP goes.
That's fine for training your own model, but I don't think you could distribute the training set. That seems like a clear copyright violation, against one of the groups that cares most about copyright.
I'm not sure that is a clear copyright violation. Sure, at a glance it seems like a derivative work, but it may be altered enough that it is not. I believe that collages, and reference guides like cliff notes are both legal.
I think a bigger problem would be that the scripts, and even the closed captioning, rarely match the recorded audio 100%
And also... it's not like the program actually contains a copy of the training data, right? The training data is a tool which is used to build a model.
How is it different from things like GPT3 which (unless I’m mistaken) is trained on a giant web scrape? I thought they didn’t release the model out of concerns for what people would do with a general prose generator rather than any copyright concerns?
Generally a ML model transforms the copyrighted material to the point where it isn't recognizable, so it should be treated as its own unrelated work that isn't infringing or derivative. But then you have e.g. GPT that is reproducing some (largeish) parts of the training set word-for-word, which might be infringing.
Also I don't think there have been any major court cases about this, so there's no clear precedent in either direction.
There are some that say that the Google Books court case is precedent for ML model stuff, if you search back through my comment history you will find links.
I am aware, I'm asking if the model, however, is infringing. Surely you can't distribute them in a dataset but is training on copyrighted data legal, and can you distribute that model?
All text written by a human in the US is automatically copyright the author. So if an engine trained on works under copyright is a derivative work, GPT3 and friends have serious problems.
I expect that wouldn't be perfect, though. Sometimes the cut that makes it into the final product doesn't exactly match the script. Sometimes it's due to an edit, other times it's due to an actor saying something similar to but not exactly what the script says, but the director deciding to just go with it.
What might work better is using closed captions or subtitles, but I've also seen enough cases where those don't exactly match the actual speech either.
Maybe you could convince a couple of indie creators or state-run programs to licence their audio? But I'm not sure if negotiating that is more efficient than just recording a bit more audio, or promoting the project to get more volunteers.