| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Hizonner 689 days ago

> Which makes it a database or dataset and very much protected by copyright.

Not every collection of numbers is a database, and a database is not the same thing as a dataset.

Databases have limited copyright-like protection in some places. Under TRIPS, that extends to only databases that are "creative by virtue of the selection or arrangement of their contents" or something along those lines. In the US they talk specifically about curation.

ML models do not meet either requirement by any reasonable interpretation.

> The fact that many people (myself included) routinely download and use models distributed under OSI approved licenses (Apache V2, MIT, etc.) makes that statement verifiably wrong.

The "source code" of an ML model is most reasonably interpreted as including all of the training data, which are never, ever available.

Now you know better.

[On edit: By the way, the people creating these works had better hope they're outside copyright, because if not, each one of them is a derivative work of (at least some large and almost impossible to identify subset of) its training data, so they need licenses from all the copyright holders of that training material, which few of them have or can get.]

1 comments

kube-system 689 days ago

If we stop unnecessarily anthropomorphizing software, I think it is plainly obvious these are derivative works. You take the training material, run it through a piece of software, and it produces an output based on that input. Just because the black box in the middle is big and fancy doesn't mean that somehow the output isn't a result of the input.

However, transformativeness is a factor in whether or not there is a fair-use exception for the derivative work. And these models are highly transformative, so this is a strong argument for their fair-use.

Hizonner 689 days ago

Maybe, but...

"Fair use" is pretty much entirely a US concept, and similar concepts in other countries aren't isomorphic to it.

The model does have a radically different form from its inputs. So you could easily imagine that being "transformative enough" for US fair use. A lot of the other fair use elements look pretty easy to apply, too. Although there's still the question of whether all the intermediate copies you made to create the model were fair use...

In fact, I'll even concede that a court could find that a model wasn't a derivative work of its inputs to begin with, and not even have to get to the fair use question. The argument would be that the model doesn't actually reproduce any of the creative elements of any particular training input.

I do think a finding like that would be a much bigger stretch than a finding that the model was copyrightable. I could easily see a world where the model was found derivative but was not found copyrightable. And it's actually not clear to me at all that the model has to be copyrightable to infringe the copyright in something else, so that's another mess.

Somewhat related, even if the model itself isn't infringing, it's definitely possible to have most models create outputs that are very similar to (some specific examples in) their training data... in ways that obviously aren't transformative. Outputs that might compete with the original training data and otherwise fail to be fair use. So even if the model is in the clear, users might still have to watch out.