| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jfarina 1855 days ago
	I wonder if they use movies and tv; recordings where the script is already available.

2 comments

wongarsu 1855 days ago

That's fine for training your own model, but I don't think you could distribute the training set. That seems like a clear copyright violation, against one of the groups that cares most about copyright.

Maybe you could convince a couple of indie creators or state-run programs to licence their audio? But I'm not sure if negotiating that is more efficient than just recording a bit more audio, or promoting the project to get more volunteers.

akiselev 1855 days ago

It would likely be a lot easier for someone from within the BBC, CBC, PBS, or another public broadcaster to convince their employer to contribute to the models. These organizations often have accessibility mandates with real teeth and real costs implementing that mandate. The work of closed captioning, for example, can realistically be improved by excellent open source speech recognition and TTS models without handing all of the power over to Youtube and the like.

It would still be an uphill battle to convince them to hand over the training set but the legal department can likely be convinced if the data set they contribute back is heavily chopped up audio of the original content, especially if they have the originals before mixing. I imagine short audio files without any of the music, sound effects, or visual content are pretty much worthless as far as IP goes.

dec0dedab0de 1855 days ago

That's fine for training your own model, but I don't think you could distribute the training set. That seems like a clear copyright violation, against one of the groups that cares most about copyright.

I'm not sure that is a clear copyright violation. Sure, at a glance it seems like a derivative work, but it may be altered enough that it is not. I believe that collages, and reference guides like cliff notes are both legal.

I think a bigger problem would be that the scripts, and even the closed captioning, rarely match the recorded audio 100%

Wowfunhappy 1855 days ago

And also... it's not like the program actually contains a copy of the training data, right? The training data is a tool which is used to build a model.

taneq 1855 days ago

How is it different from things like GPT3 which (unless I’m mistaken) is trained on a giant web scrape? I thought they didn’t release the model out of concerns for what people would do with a general prose generator rather than any copyright concerns?

sodality2 1855 days ago

Does using copyrighted works to train a machine learning model make that model infringing?

wongarsu 1855 days ago

Generally a ML model transforms the copyrighted material to the point where it isn't recognizable, so it should be treated as its own unrelated work that isn't infringing or derivative. But then you have e.g. GPT that is reproducing some (largeish) parts of the training set word-for-word, which might be infringing.

Also I don't think there have been any major court cases about this, so there's no clear precedent in either direction.

visarga 1855 days ago

> But then you have e.g. GPT that is reproducing some (largeish) parts of the training set word-for-word, which might be infringing.

Easy fix - keep a bloom filter of hashed ngrams ensuring you don't repeat more than N words from the training set.

pabs3 1855 days ago

There are some that say that the Google Books court case is precedent for ML model stuff, if you search back through my comment history you will find links.

sodality2 1855 days ago

Thanks!

marcodiego 1855 days ago

GP is not talking about the model but about the training data set.

sodality2 1855 days ago

I am aware, I'm asking if the model, however, is infringing. Surely you can't distribute them in a dataset but is training on copyrighted data legal, and can you distribute that model?

_jal 1855 days ago

All text written by a human in the US is automatically copyright the author. So if an engine trained on works under copyright is a derivative work, GPT3 and friends have serious problems.

kelnos 1855 days ago

I expect that wouldn't be perfect, though. Sometimes the cut that makes it into the final product doesn't exactly match the script. Sometimes it's due to an edit, other times it's due to an actor saying something similar to but not exactly what the script says, but the director deciding to just go with it.

What might work better is using closed captions or subtitles, but I've also seen enough cases where those don't exactly match the actual speech either.

taneq 1855 days ago

They might work even better for interpreting the intent of spoken text. Not great for dictation though.

habibur 1855 days ago

He meant subtitle when he talked of script.