Hacker News new | ask | show | jobs
by Mordisquitos 526 days ago
If the resulting AI models are protected by copyright that invalidates the claim that AI models being trained on copyrighted materials is fair-use analogous to human beings becoming educated by exposure to copyrighted materials.

Educated human beings are not protected by copyright, hence neither should trained AI models. Conversely, if a copyrightable work is produced based on work which itself is copyrighted, the resulting work needs the consent of the original authors of the prior work.

AI models can't have their ©ake and eat it.

2 comments

> If the resulting AI models are protected by copyright that invalidates the claim that AI models being trained on copyrighted materials is fair-use analogous to human beings becoming educated by exposure to copyrighted materials.

No one training (foundation) models makes that fair use argument by analogy, they make arguments that addresses the specific statutory and case law criteria for fair use (abd frequently focus on the transformative character of the use); its true that the analogy to a learning human argument is frequently made in internet fora by AI enthusiasts who aren't the people training models on vaat scraped datasets. That argument is bunk for a number of reasons, but most critically the fact that a human learning from material isn’t fair use, because a human brain isn’t treated as a fixed medium, so learning in a human brain isn’t legally a copy or derivative work that would violate copyright without the fair use exception, so it's not a use to which fair use analysis even applies, so you can't argue anything is fair use by analogy to that. But its moot to any argument for hypocrisy by the big model makers, because they aren’t using that argument to start with.

If I take 1000 books and count the distributions of the lengths of the words, and the covariance between the lengths of one word and the next word for each book, and how much this covariance matrix tends to vary across the different books, and other things like this, and publish these summaries, it seems fairly clear to me that this should count as fair use.

(Such a model/statistical-summary, along with a dictionary, could be used to generate nonsensical texts which have similar patterns in terms of just word lengths.)

Should the resulting work be protected by copyright? I’m not entirely sure…

I guess one thing is, the specific numbers I obtain by doing this are not a consequence of any creative decision making on my part, which I think in some jurisdictions (I don’t remember which) plays a role in whether a work is copyrightable (I will use “copyrightable” as an abbreviation for “protected by copyright”. I don’t mean to imply a requirement that someone specifically registers for copyright.). (Iirc this makes it so phone books are copyrightable in some jurisdictions but not others?)

The particular choice of statistical analysis does seem like it may involve creative decision making, but that would just be about like, what analysis I describe, and how the numbers I publish are to be interpreted, not what the numbers are? (Analogous to the source code of an ML model, not the parameters.)

Here is another question: suppose there is a method of producing a data artifact which would be genuinely (and economically) useful, and which does not rely on taking in any copyrighted input, but requires a large (expensive) amount of compute to produce, and which also uses a lot of randomness so that the result would be different each time it was done (but suppose also that there isn’t much point doing it multiple times at the same scale, as having two of this kind of data artifact wouldn’t be much more valuable than having one).

Should such data artifacts be protected by copyright or something like it?

Well, if copyright requires creative human decision making, then they wouldn’t be.

It seems like it would make sense to want it to be economically incentivized to create such data artifacts of higher sizes (to a point of course. Only as much as is justified by the value that is produced by them being available.) .

If such data artifacts can always be distributed without restriction, then ones that are publicly available would be public goods, and I guess only ones that are trade secrets would be private goods? It seems to me like having some mechanism to incentivize their creation and being-eventually-freely-distributed would be beneficial?

But maybe copyright isn’t the best way to do that? Idk.

> The particular choice of statistical analysis does seem like it may involve creative decision making

The selection and structuring of the training set may involve sufficient creativity to be copyrightable (as demonstrated by the existence of “compilation copyrights”), even if it is largely or even entirely composed of existing works, the statistical analysis part doesn't have to be the source of the creativity.

'Should the resulting work be protected by copyright? I’m not entirely sure…'

This has already been settled hasn't it? Don't companies have to introduce 'flaws' in order for data sets to be 'protected'? Just compiled lists of facts can't be protected. Which is why things like election result companies having to rely on NDAs and not copyright protections to protect their services on election night.

> This has already been settled hasn't it? Don't companies have to introduce 'flaws' in order for data sets to be 'protected'?

No, flaws are generally introduced to make it easier to detect copies; if multiple flawless reference works covering the same data (road maps of the same region, for instance) exist, each is copyrightable without flaws to the extent any would be with flaws, but you can't prove that someone else copied yours without permission if copying any of the others would give the same result. With flaws, gou can attribute the source that was copied more easily, but this isn't about being legally protected but about the practicality of enforcing that protection.

> suppose there is a method of producing a data artifact which would be genuinely (and economically) useful, and which does not rely on taking in any copyrighted input, [...] It seems like it would make sense to want it to be economically incentivized to create such data artifacts of higher sizes [...] But maybe copyright isn’t the best way to do that? Idk.

Exactly. It would be patents, not copyright.