Hacker News new | ask | show | jobs
by rrwo 1228 days ago
No, it isn't.

If I publish something on the internet, I am only making it available for people to read.

That does not give anyone the right to use it in other ways.

3 comments

There is a thing called "fair use" that lets people do more than just read your copyrighted works.

https://www.copyright.gov/help/faq/faq-fairuse.html

We make use of this exception on HN all the time, like this, from the link above:

> How much of someone else's work can I use without getting permission?

> Under the fair use doctrine of the U.S. copyright statute, it is permissible to use limited portions of a work including quotes, for purposes such as commentary, criticism, news reporting, and scholarly reports. There are no legal rules permitting the use of a specific number of words, a certain number of musical notes, or percentage of a work. Whether a particular use qualifies as fair use depends on all the circumstances.

When the entire work is synthesized for the purpose of training a model it's not "limited portions"

It makes sense there is no limit on the number of words that can be used under fair use, but it's certainly less than all of them.

Context matters. You are taking one FAQ question and generalizing. It is permissible to reproduce or show complete works in education, depending on the context. It is also completely normal to read copyrighted work in private, learn something from it, and then answer questions about that work publicly.

The questions around LLMs learning from copyrighted material are still open and need to be settled in court. I personally imagine finding infringement would impose more harm on society and progress than letting the models acquire knowledge from these copyrighted works.

> letting the models acquire knowledge from these copyrighted works

I'm gagging at the nonsensical anthropomorphizing being done to end-run the fact that what the LLMs are doing is copying.

You make a lot of condescending or toxic remarks on HN. You might want to consider how that affects your ability to sway others with your comments.

Please chill

> the fact that what the LLMs are doing is copying.

I disagree, the training process creates token representations and weighted connections between them. The models later produce probabilistic token sequences, not so unlike what our meat bodies do, though by very different mechanisms. The fact that certain sequences can be reproduced verbatim is likely a consequence of overfitting. They certainly cannot reproduce all training data verbatim. It would be interesting to know the features around what can and cannot be, and how.

> The models later produce probabilistic token sequences, not so unlike what our meat bodies do, though by very different mechanisms.

Your response to me calling out your baseless anthropomorphizing was to double down on it? It's amusing to me that you don't think you are condescending.

But that's also not how copyright works. At least in the Unites States, the protections offered by copyright center around reproduction, performance, and derivative works[1].

If the AI models are reproducing copyrighted works, then that's a problem. And it does look like there are some examples where that might be happening beyond notions of fair use. But slupring up copyrighted content to train a model seems to fall under allowed use.

[1] https://www.copyright.gov/what-is-copyright/

US law only applies to the US, and what is legal is not the same as what is moral.

As for fair use, this is not the same as someone remixing or sampling songs, or writing fan fiction or satire or quoting works for criticism.

People do not want models trained on their creative works, so that someone else can make money using those models to produce similar creative works as a service for third parties.

While it is possible to create similar creative works -and I will grant that that could be a prima facie problem- it is also possible to make rather new creative works as well. Just like you can do by hand, you can interpolate and extrapolate from known starting points, and there is nothing stopping you coming up with something totally unique.
Training a model maybe, but is it clear that the output of the model isn't a derivative work?
To about the same degree as the output of a human.

I just started writing a new novel. It's an interesting, in my opinion highly novel fantasy/SF(ish) story, for once not fanfiction of anything that's still in copyright -- most people wouldn't count stories based on ancient norse mythology as 'fanfiction' -- but that doesn't mean it isn't derivative. It means, instead of naming two or three things it's derivative of, I can name ten to fifteen.

That's normal. All stories are derivative, and if you point me at an author who claims theirs aren't, you're pointing at a liar. The job of an author is to put the building blocks together in a new and interesting form, not to make them up from whole cloth. It's impossible to invent more than two or three truly novel ideas per day, even if you're incredibly imaginative, and most of those won't be any good.

The difference between humans and AIs, nowadays, seem to be that the AIs use millions of sources instead of ten to fifteen. Or, alternately, that they use none -- and theirs is less derivative -- because certainly everything I've ever read goes into my writing, not just the things I recognise I'm using.

>To about the same degree as the output of a human.

No. Full stop. Humans aren't stochastic parrots. Pointing to a lack of understanding about what exactly happens in the human mind is, FULL STOP, not evidence that LLMs are doing the same things humans do.

This being HN, I get to be pedantic ;-).

Humans are not stochastic, they're obviously chaotic[1]. Which is to say: not parrots at all.

Some of the modern models I've seen also seem to be chaotic too though, so that's interesting [2]. I'm going to assume LLMs probably exhibit the same properties.

[1] https://en.wikipedia.org/wiki/Chaos_theory (Chaotic systems sometimes seem to be stochastic, but they're actually much stranger and more interesting!)

[2] I've been messing with stable diffusion to get a feel for (and/or avoid) tipping points: that is to say, points in latent space where the model becomes very sensitive to small changes in initial parameters. You can find instances fairly quickly even by hand by doing bisect search.

>[2]. I'm going to assume LLMs probably exhibit the same properties.

That's quite an assumption to make.

I don't think I ever claimed that?

That's not my argument. My argument is that the anti-AI arguments, as spoken, also match to what I know I'm doing as a human. In my opinion better than it matches to what the AIs are doing, because as you say, they aren't human.

Maybe the output isn't, but what the LLM turns the work into when it becomes a constituent element of the model is probably a derivative work.
For clarity, the term is about rights to make copies.

It in no way blocks right to absorb, understand, or stand upon for the next idea.