Hacker News new | ask | show | jobs
by enord 887 days ago
I’m completely flabbergasted by the number of comments implying copyright concepts such as “fair use” or “derivative work” apply to trained ML models. Copyright is for _people_, as are the entailing rights, responsibilities and exemptions. This has gone far beyond anthropomorphising and we need to like get it together, man!
2 comments

You act like computers and ML models aren't just tools used by people.
What did I write to give you that impression?
My initial interpretation was that you're saying fair use is irrelevant to the situation because machine learning models aren't themselves legal persons. But, fair use doesn't solely apply to manual creation - use of traditional algorithms (e.g: the snippets, caching, and thumbnailing done by search engines) is still covered by fair use. To my understanding, that's why ronsor pointed out that ML models are tools used by people (and those people can give a fair use defense).

Possibly you instead meant that fair use is relevant, but people are wording remarks in a way that suggests the model itself is giving a fair use defence to copyright infringement, rather than the persons training or using it?

Well then I could have been much clearer because I meant something like the latter.

An ML model can neither have nor be in breach of copyright so any discussion about how it works, and how that relates to how people work or “learn” is besides the point.

What actually matters is firstly details about collation of source material, and later the particular legal details surrounding attribution. The last part involves breaking new ground legally speaking and IANAL so I will reserve judgement. The first part, collation of source material for training is emphatically not unexplored legal or moral territory. People are acting like none of the established processes apply in the case of LLMs and handwave about “learning” to defend it.

> and how that relates to how people work or “learn” is besides the point

It is important (for the training and generation stages) to distinguish between whether the model copies the original works or merely infers information from them - as copyright does not protect against the latter.

> The first part, collation of source material for training is emphatically not unexplored legal or moral territory.

Similar to as in Authors Guild v. Google, Inc. where Google internally made entire copies of millions of in-copyright books:

> > While Google makes an unauthorized digital copy of the entire book, it does not reveal that digital copy to the public. The copy is made to enable the search functions to reveal limited, important information about the books. With respect to the search function, Google satisfies the third factor test

Or in the ongoing Thomson Reuters v. Ross Intelligence case where the latter used the former's legal headnotes for training a language model:

> > verbatim intermediate copying has consistently been upheld as fair use if the copy is "not reveal[ed] to the public."

That it's an internal transient copy is not inherently a free pass, but it is something the courts take into consideration, as mentioned more explicitly in Sega v. Accolade:

> > Accolade, a commercial competitor of Sega, engaged in wholesale copying of Sega's copyrighted code as a preliminary step in the development of a competing product [yet] where the ultimate (as opposed to direct) use is as limited as it was here, the factor is of very little weight

And, given training a machine learning model is a considerably different purpose than what the images were originally intended for, it's likely to be considered transformative; as in Campbell v. Acuff-Rose Music:

> > The more transformative the new work, the less will be the significance of other factors

Listen, most website and book-authors want to be indexed by google. It brings potential audience their way, so most don’t make use of their _right_ to be de-listed. For these models, there is no plausible benefit to the original creators, and so one has to argue they have _no_ such right to be “de-listed” in order to get any training data currently under copyright.
No one is saying a model is the legal entity. The legal entities are still people and corporations.
Oh come on, you’re being insincere. Wether or not the model is learning from the work just like people is hotly debated as if it would make a difference. Fair use is even brought up. Fair use! Even if it applied, these training sets collate all of everything

I feel like I’m taking crazy pills TBQH