| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kube-system 689 days ago
	> mechanically applying that software to datasets that are (a) assembled with minimal, if any creativity, and (b) definitely not assembled with any eye to the specific form of the resulting model. Fair enough, but those datasets are also primarily copyrighted material. If the software here merely transforms the input material (which I agree it does), then the output is a derivative work.

1 comments

Hizonner 689 days ago

But the pitch-shifted song is still recognizably a creative work. It has identifiable, humanly comprehensible forms of all the original creative elements that Swift originally put into it (plus I guess a de minimis amount of extra creativity from the choice to pitch shift it).

If I take a string of data from a true hardware RNG, XOR it with a Taylor Swift song, and throw away the original random stream, is the resulting fundamentally random bit string still a derivative work of the song? As with the ML model, you can't recognize the song in it. And as with at least some training examples in the inputs of most ML models, you can't recover the song from it either.

It feels like the test for whether X is derivative for copyright purposes should include some kind of attention to whether X is a creative work at all. Maybe not, but then what test do you use?

I do recognize the possibility that the models might not themselves be eligible for copyright as independent works, yet still infringe copyright in the training inputs. It seems messy, but not impossible.

... and as I said elsewhere, it's also messy that while you generally can't recover every training input from the model, you can usually recover something very close to some of the training inputs.

link

astrange 689 days ago

> If I take a string of data from a true hardware RNG, XOR it with a Taylor Swift song, and throw away the original random stream, is the resulting fundamentally random bit string still a derivative work of the song? As with the ML model, you can't recognize the song in it. And as with at least some training examples in the inputs of most ML models, you can't recover the song from it either.

It's not a copy of it, and when you distribute it you're not distributing the original. So it's not a derivative for copyright purposes.

It can still be a derivative for other legal purposes. Judges don't appreciate it when you do funny math tricks like that and will see through them.

> It feels like the test for whether X is derivative for copyright purposes should include some kind of attention to whether X is a creative work at all. Maybe not, but then what test do you use?

Yes, that's how US copyright law works. (well sort of…)

Being a transformative work of something makes it less of a copy of it, the more transformed it is, since it falls under fair use exemptions or is clearly a different category of thing.

If a model was a derivative of its training data, then Google snippets/thumbnails would be derivatives of its search results and would be illegal too. Unless you wrote a new law to specifically allow them.

In other countries (Germany, Japan) fair use is weaker, but model training has laws specifically making it legal in certain circumstances, and presumably so do Google snippets.

link

Hizonner 689 days ago

> It's not a copy of it, and when you distribute it you're not distributing the original.

A compressed (or normally encrypted) version wouldn't be a copy that way, either, but I would still absolutely go down for distributing it. The difference is that the compression can be reversed to recover the original. Even lossy compression would create such a close derivative that nobody would probably even bother to make the distinction.

You're right that "math games" don't work in the law, but that cuts both ways. If you do something that truly makes the original unrecoverable and in fact undetectable, and if nothing salient to the legal issues at hand about the new version derives from the original, then judges are going to "see through" the "math trick" of pretending that it is a derivative.

> then Google snippets/thumbnails would be derivatives of its search results

Thumbnails are legally derivative works, in the US and probably most other places. In the US, they're protected by the fair use defense, and in other places they're protected by whatever carveouts those places have. But that doesn't mean they're not derivative works.

In fact, if I remember the US "taxonomy" correctly, thumbnails are infringing. It's just that certain kinds of ingfringement are accepted because they're fair use.

If thumbnails weren't derivative works at all, then the question of fair use wouldn't arise, because there can be no infringement to begin with if the putatively infringing work isn't either derivative or a direct copy.

Where thumbnails are different from ML models is that they're clearly works of authorship. In a thumbnail, you can directly see many of the elements that the author put into the original image it's derived from.

The questions are (a) whether ML models are works of authorship to begin with (I say they're not), and (b) whether something that's not a work of authorship can still be a derivative work for purposes of copyright infringment (I'm not sure about that).

So far as I know, neither one is the subject of either explicit legislation or definitive precedent in most of the world, including the US.

link