Hacker News new | ask | show | jobs
by deadbabe 531 days ago
Fair use.
3 comments

*only available in the USA, terms and conditions apply.

most other places use fair dealing which is more restrictive https://en.m.wikipedia.org/wiki/Fair_dealing

Easy to claim, harder to justify once you start charging money for your subsequent creation.

Unless all LLM are a ruthless parody of human intelligence, which they may be, the legal issues will continue.

The moment you earn money from it, that's not fair use anymore. When I last checked, unlimited access to said models were not free, plus it's not "research" anymore.

- Addenda -

For the interested parties, the law states the following [0].

Notwithstanding the provisions of sections 17 U.S.C. § 106 and 17 U.S.C. § 106A, the fair use of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include:

    1. the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
    2. the nature of the copyrighted work;
    3. the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
    4. the effect of the use upon the potential market for or value of the copyrighted work.

The fact that a work is unpublished shall not itself bar a finding of fair use if such finding is made upon consideration of all the above factors

So, if you say that these factors can be flexed depending on the defendant, and can be just waved away to protect the wealthy, then it becomes something else, but given these factors, and how damaging this "fair use" is, I can certainly say that training AI models with copyrighted corpus is not fair use in any way.

Of course at the end of the day, IANAL & IANAJ. However, my moral compass directly bars use of copyrighted corpus in publicly accessible, for profit models which undermine many people of their livelihoods.

From my perspective, people can whitewash AI training as they see fit to sleep sound at night, but this doesn't change anything from my PoV.

[0]: https://en.wikipedia.org/wiki/Fair_use#U.S._fair_use_factors

I really don't think it's that simple. I can read books and then earn money from applying what I learned in them. I can also study art and then make original art in the same or similar styles. If a person was doing this there would be no one claiming copyright infringement. The only difference is it's a machine doing it and not a person.

The nature of copyright and plagiarism boils down to paraphrasing, and so long as LLMs sufficiently paraphrase the content it's an open question whether it's copyright infringement and requires new law/precedent.

So the fact they are earning money is a red herring unless they are reproducing the exact same content without paraphrasing (with exception to commentary). E.g. they can quote part of a work while commenting on it.

Where they have gotten into trouble with e.g. NYT afaik is when the LLM reproduced a whole article word for word. I think they have all tried hard to prevent the LLM from ever doing that to avoid that legal risk.

> I can read books and then earn money from applying what I learned in them.

How many books can you read, understand and memorize in T time, and how many books an AI can ingest in the T time?

If we're down to paraphrasing, watch this video [1], and think again.

Many models, given that you ask the correct questions, reproduce their training set with great accuracy, and this is only prevented with monkey patching, IIUC.

So, it's still a big mess, even if we don't add copyrighted corpus to the mix. Oh, BTW, datasets like "The Stack" are not clean as they claim. I have seen at least two non-permissively licensed code repositories inside that dataset.

[1]: https://youtu.be/LrkAORPiaEA

I agree it's a big mess, that was kind of my point.

I am curious about the video, but am not compelled to spend 24 min watching it when you haven't summarized its thesis for me. The title of the video makes it seem adjacent at best to the points I was making. (Some automated flagging system =/= actual law)

"Making money" does not immediately invalidate fair use, but it does wave a big red flag in the courts' faces.
I would be more nuanced on this matter. As I understand, in the US, fair use allows media to write critiques of cultural artefacts (sorry, I cannot think of a better, broad term). For example, you can include small quotes from the film script when writing a critique of it without requiring permission from the owner of the copyright. And, until the World Wide Web arrived to the masses in the mid-1990s, most critiques were published by commercial media outlets, such as a daily newspaper. They were certainly published by commercial, for-profit entities. That said, I think the intent of the fair use is very important to the courts, much more than the entity that is doing the fair use (newspaper, blogger, etc.).

Another weird carve-out for copyright law in the US: parody. Honestly, I don't know if other jurisdictions allow parody in the same protected manner.

> Another weird carve-out for copyright law in the US: parody. Honestly, I don't know if other jurisdictions allow parody in the same protected manner.

Germany: https://www.gesetze-im-internet.de/urhg/__51a.html (Though this explicit carve-out is a recent development, though generally speaking parodies were allowed even under the previous version of the law.)

Your reference (link) is very impressive. Thank you to share. Honestly, I would struggle to provide the equivalent for US federal law (or court ruling). Are you a lawyer in DACH/Germany? How did you know to find this web page?
> Are you a lawyer in DACH/Germany?

Nope :-), just a normal citizen, but sometimes I am curious enough to look up a law, plus sometimes I need to refer to/look up some law in my day job as a civil engineer, too.

When you need to do that, it's not too hard to stumble upon the existence of that page through some web searches, plus the German Wikipedia often links to that page, too (as well as to some alternative platforms run by private entities, which sometimes provide some added value, e.g. buzer.de provides change history since 2006, too, other pages link relevant court decisions, etc. etc. – but gesetze-im-internet.de is the official page run by the federal government itself).

So you say that, every law is a suggestion depending who's being tried?
Er, what? I'm speaking directly from the law, 17 U.S.C. § 107. It's deliberately written in terms of "factors to consider", rather than absolutes.

> In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include:

> * the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;

> * the nature of the copyrighted work;

> * the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and

> * the effect of the use upon the potential market for or value of the copyrighted work.

You can absolutely monetize works altered under fair use.
Any examples sans current AI models? I have not seen any, or failed to find any, to precise.
Basically any YouTube video that shows another YouTube video, song, movie, etc. as part of something else (eg a voiceover.)