| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by thebrid 657 days ago

As much as I love the Internet Archive, is it really that crazy? The four factors used for determining fair use are:

  * the purpose and character of the use
  * the nature of the copyrighted work;
  * the amount and substantiality of the portion used in relation to the copyrighted work as a whole
  * the effect of the use upon the potential market for or value of the copyrighted work.

In the Internet Archive case, they're distributing whole, unmodified copies of copyrighted works which will of course compete with those original works.

In the AI use case, they're typically aiming not to output any significant part of the training data. So they could well argue that the use is transformative, reproducing only minimal parts of the original work and not competing in the market with the original work.

4 comments

haswell 657 days ago

To me, the point isn’t that what the IA was doing was fair use, but that what LLMs are doing arguably is not.

> In the AI use case, they're typically aiming not to output any significant part of the training data

What they’ve aimed to do and what they’ve done are two different things. Models absolutely have produced output that closely mirrors data they were trained on.

> not competing in the market with the original work

This seems like a stretch, if only because I already see how much LLMs have changed my own behavior.

These models exist because of that data, and directly compete by making it unnecessary to seek out the original information to begin with.

link

halJordan 656 days ago

But look at your own argument. LLMs are not fair use because they might be prompted into regurgitating something substantially similar to the trained data.

And yet, the IA is 100% aiming to absolutely reproduce literally every part of the work in a 100% complete manner that replaces the original use of the work.

And you cannot bring yourself to admit that the IA is wrong. When you get to that point you have to admit to yourself that you're not making an argument your pushing a dogma.

link

haswell 656 days ago

I’m not arguing that the IA is right or wrong here.

The point more generally is that there’s an asymmetry in how people are thinking about these issues, and to highlight that asymmetry.

If it turns out after various lawsuits shake out that LLMs as they currently exist are actually entirely legal, there’s a case to be made that the criteria for establishing fair use is quite broken. In a world where the IA gets in legal trouble for interpreting existing rules too broadly, it seems entirely unjust that LLM companies would get off scott free for doing something arguably far worse from some perspectives.

link

codedokode 656 days ago

IA was lending a digital copies (only one user at a time may read the book), it was acting like a library lending out physical books, only IA did it over the Internet which is more convenient. IA is non-profit.

What publishers argue is that you cannot treat digital books like physical ones; i.e. you cannot re-sell or lend (like IA did) a digital book.

What LLM do is that they use copyrighted content for profit and do not lend anything.

link

JoshTriplett 656 days ago

> and not competing in the market with the original work

AI absolutely competes in the market with the original works it trains on, and with new works in those same markets. Proponents of unrestricted AI training loudly tout and celebrate that it does so.

Which would be fine, if everyone else had the same rights to completely ignore copyright. The asymmetry here seems critically broken.

link

hiatus 657 days ago

> In the Internet Archive case, they're distributing whole, unmodified copies of copyrighted works which will of course compete with those original works.

Libraries would be illegal if conceived of today. If this weren't digital it would be a violation of first sale doctrine.

link

tptacek 656 days ago

How? Libraries lend out actual physical objects. They're not xeroxing the books and handing them out.

link

hiatus 656 days ago

The actual opinion rules on the concept of controlled digital lending more broadly. From page two:

> "This appeal presents the following question: is it “fair use” for a nonprofit organization to scan copyright-protected print books in their entirety and distribute those digital copies online, in full, for free, subject to a one-to-one owned-to-loaned ratio between its print copies and the digital copies it makes available at any given time, all without authorization from the copyright-holding publishers or authors? Applying the relevant provisions of the Copyright Act as well as binding Supreme Court and Second Circuit precedent, we conclude the answer is no."

link

tcgv 656 days ago

Exactly. And if a book is in high demand in a library, you'd either have to wait your turn or purchase one yourself to avoid the lending queue.

link

cool_dude85 656 days ago

The IA's controlled digital lending setup worked the same way.

link

tptacek 656 days ago

No, the IA's CDL system required them to make multiple copies of books (one to digitize the book, and one for every reader of the book), which is not a legal problem a physical library runs into.

link

cool_dude85 656 days ago

I agree, and apparently this distinction is legally relevant. However, it does not change my point that the CDL also has the property that:

"if a book is in high demand in a library, you'd either have to wait your turn or purchase one yourself to avoid the lending queue."

link

beardyw 656 days ago

> Libraries would be illegal if conceived of today.

Just shows how far forward we have progressed. Maybe book burnings next to prevent resale?

link

aezart 656 days ago

I don't understand how AI companies can claim that they're not aiming to output the training data when the loss function is "how well can model memorize the dataset?".

link