| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by d110af5ccf 1811 days ago

> even if training a dataset is fair use, distributing the result is copyright infringement

I would be inclined to agree that the current situation (ie reproducing training examples verbatim) violates copyright. On the other hand, I'm not so sure that a trained model does (or even should) be subject to the copyright of the inputs.

Of course I acknowledge that the latter view is controversial and also that such issues are so new that they haven't had a chance to be meaningfully addressed by either the courts or the legislature yet.

As an example of a similar situation, see (https://www.thiswaifudoesnotexist.net/) which was trained entirely on copyrighted artwork. Note that there are at least three distinct issues here - training the model, distributing the model itself, and distributing the output of the model.

> I would want my license to make that part clearer.

But again, GitHub's argument here is that the license is completely irrelevant because it doesn't apply in the first place. Thus they won't care one bit about any clarifications you make one way or the other.

1 comments

ghoward 1811 days ago

You said that you're "not so sure that a trained model does (or even should) be subject to the copyright of the inputs."

You missed my point. I'm not saying that the model is subject to the copyright of the inputs; I'm saying that the model's outputs are, which is entirely different. We say that the output of a compiler is still subject to the copyright of the inputs, so why not this?

link

d110af5ccf 1811 days ago

I misspoke. (Err mistyped?) I suspect there will often be a stronger case to be made for the model itself falling under copyright than what it outputs. It's up to the courts and the legislature in the end though, so who knows.

Anyway, by providing public access to this thing I infer GitHub to be taking the position that copyright doesn't apply to the output. (And I suspect they are wrong, in particular because of the verbatim code samples people have managed to coax out of it.)

link