Hacker News new | ask | show | jobs
by scarface_74 1091 days ago
You don’t get to make information publicly available. But not publicly available. If you want your blog to be restricted, put it behind a login
4 comments

Yes I do. I own the work I create, even if it's publicly available. I do get to decide what happens with it.
> I do get to decide what happens with it.

No. Both legally and practically, you absolutely do not.

The only thing copyright law gives you is an exclusive right to sell it for a limited period of time, as a whole in its original form or similar -- and to transfer that right.

Regardless of your desires, anyone can reuse it under the conditions of fair use. They can copy parts of it for parody purposes. If they're not selling anything or taking away from your sales*, they can reproduce it verbatim for private purposes. And even if they are selling something, they can summarize it, quote from it, rephrase it, and so forth.

And you don't actually get to decide any of that.

* Edit: added "or..."

So you’re saying I’m right except in some narrowly carved-out situations. And I agree with you.
Nope. You said:

> I wasn't asked and I don't really care to donate work to large corporations like that... I do get to decide what happens with it.

And I said:

> No. Both legally and practically, you absolutely do not.

You think you get to decide whether large corporations can train on your work. I'm saying the the law suggests you very much don't get to decide that.

Read the comments you're replying to. I didn't comment on the legality of ChatGPT training on my content, I said I didn't like it. Regardless, the act of posting content publicly does not mean I give up my copyright claim. Yes, there are fair use situations. Training ChatGPT might be one of them, but I'm not seeing lot of concrete information one way or the other and I am seeing arguments that ChatGPT could be considered a derivative work, which would place OpenAI in violation of my copyright.

Send some links if you see some definitive case law sorting this stuff out.

You are claiming that piracy is legal.
Anyone can read your blog and then post their own blog post using knowledge they learned while reading yours. ChatGPT "learned" from your blog that same way
Since the way GPT "learns" is not materially similar to how a human learns, I don't see why this talking point is particularly relevant. Nothing stops the courts from distinguishing between an AI and a human with regard to what may be permissible.
I agree, it seems like all the arguments that the use of data by AI should have no more restrictions than the use of data by humans hinge on the implicit (or sometimes explicit) assumption that human learning and machine learning are identical. While there are parallels, there also seem to be significant differences not only in how the learning is done, but also in outcomes for the person whose data is being used. And since a major purpose of IP, copyright, etc. is at least ostensibly to protect the creators of information from negative outcomes, I don't think the outcomes can be ignored when comparing human learning to ML.
Anthropomorphizing that it "learned" is disingenuous and I expect better from the HN crowd.

If ChatGPT regurgitates verbatim or nearly verbatim, something it slurped up from OP's blog, is that not plagiarism? Where do you draw the line? Where would a reasonable person draw the line?

A human is both capable of reciting things from memory in an infringing manner, and learning from experiences to create something new. Maybe we should tape people's mouth shut if they dare to violate copyright by reciting a copyrighted book word for word or put them in a straight jacket if they recreate a copyrighted painting from memory.
Actually I fear that people that say this are doing worse than anthropomorphizing.

Often rather than claiming human aspects to the machine, they are going further, and claiming machine aspects to the human.

Using mechanistic analogies for explaining the human body or mind isn't new, but as machines become better and better at imitating humans, those analogies become more seductive.

That's my rant; the danger with 'AI' isn't so much that humans are enslaved by machines, but that we enslave each other -- or dehumanize each other -- with machines.

Like with everything in law, "intent" is paramount. Obviously it's not the trainer's, nor the end-user's goal to reproduce training set data verbatim; quite contrary, overfitting as such is undesirable.
Intent only goes so far. If I continually but unintentionally reproduce copyrighted works verbatim, I could still face consequences, particularly if I did not show due diligence in preventing it from happening in the first place.
But ChatGPT doesn’t spit out verbatim from the blog.
Computers aren't people. Software isn't humans.
There is a difference between learning from your work and copying your work.

You are entitled to control it's distribution and use. You are not entitled to control it's influence and effects.

I think you've made up an irrelevant argument. The work has been incorporated into a commercial product, intentionally, under the control of someone else. Software isn't humans that pay taxes, appear in court, have rights, etc.
No, the work has not been. The impression that the work leaves on a neural network has been though.

AIs are not massive repositories of harvested data. The models are relatively small (<20GB).

A resized, smaller, or encoded version of an image is still subject to copyright. Calling an encoding an 'impression' is deceitful.
Not always.

https://www.pinsentmasons.com/out-law/news/google-thumbnails...

> A US court ruled this week that Google's creation and display of thumbnail images does not infringe copyright. It also said that Google was not responsible for the copyright violations of other sites which it frames and links to.

It's none of the those things, these models train on petabytes of data. They store relationships of objects to each other, not objects themselves.
Actually, people have been successfully sued for plagiarizing other works because they had internalized it and accidentally regurgitated it. So. The fact that content runs through a human brain doesn’t necessarily cleanse it from copyright concerns.
There is no "actually" because you are still addressing distribution. It wouldn't be hard to have another AI that analyzes outputs for copywriter infringement and culls them as necessary.

Would that satisfy you?

To some extent. Others can ingest your work, quote it, talk about it, criticize it, summarize, etc.
If I read your blog and used its data along with my own knowledge to create a course, would that be plagiarism or copyright violation?
>You don’t get to make information publicly available. But not publicly available.

But we do? Open sourcing something with caveats is common. This code is public BUT not for commercial use. This code is public BUT you must display attribution etc.

Sure, blogposts are unlicensed (that I know) but the idea of something publicly available being held to restrictions is nothing new.

Do you allow commercial employees to read the code and incorporate knowledge obtained from the code into their brains?
This is a fantastic point. I can legally go pick up any strictly copyrighted book at a store and read parts of it for free which I will then have learnt and have in my brain to share with to anyone else. If I happen to have a superintelligent brain I can potentially gain a lot more and make a lot more inferences from this one outing and consequently add a lot of value to others I share my info to.

But telling me it is illegal to share what I learnt because the original source is copyrighted... doesn't sit right with me.

Copyright just doesn't protect such cases. There's a funny exaggeration that is very illustrative: copyright protects the bugs in the code. I.e. the specific way in which code was written. Reading it and getting inspired was never meant to break copyright.

What protects particular solutions is patents. For example if someone were to obtain a patent for computing GCD of large integers the usual fast way, well then everyone else would have to use a different solution.

This analogy to someone reading a book, perhaps peppered with lots of legalese to the point of being hardly recognizable, will definitely be used in courts at some point. And I can't see how it wouldn't stand as a valid argument.

If you go read a book, memorize it, write it down later in a substantively similar form, and share it freely or sell it — yes, you might get into copyright trouble. It has happened before and it is at best a tricky gray area.

If you pick up a book and learn a fact, then yeah, you’re allowed to share that fact.

It’s weird that this topic keeps devolving into a form of “so what, it’s illegal for me to learn things?” Because: no, it’s not. And: You and a piece of software are treated differently under the law. You have a different set of rights than ChatGPT.

Everything ChatGPT seems gray area and might which is probably why we are where we are.
> You have a different set of rights than ChatGPT.

Gods, no. Where did you get that from?

Are you a human being? A citizen of some country? If so you definitely have a different set of rights than ChatGPT.

Those might not be a problem regarding this specific case, but the case can easily be made that it ought to be.

Show me where ChatGPT's brain is and your comparison will become relevant.
I mean in the floating point / quantized numbers and the connections that make the model? I'm not sure I follow, the analogy to the human brain has always been obvious, it's even in the name (artificial neural network) ...
The analogy is just that: an analogy, and a very imperfect, misleading one. The working of the brain may have motivated early research, but GPT (as instantiated in hardware) does not operate or learn in a way similar to a human brain.
Yes, it's completely unfeasible to make a license to control that.

On the other hand, it's completely feasible to make a license that stops someone from training their model with some piece of info, is it not?

Why is it that people keep on flogging dead horses?
That's not how copyright works.
Another day, another person on HN showing us how they don't understand the difference between Public Domain and Open Source or Copyleft etc.

And regardless -- the problem now is that expectations of how content can be consumed are now fundamentally violated by automation of content ingestion. People put stuff up on the Internet with the expectation of its consumption by human minds, which have inherent limitations on the speed and scale on which they can learn from and reproduce things, and those humans are also legally liable, socially/ethically obligated, etc.

Now we have machines which skirt the limits of legality, and are able to do so on massive scale and without responsibility to society as a whole.

Different game now.

> People put stuff up on the Internet with the expectation of its consumption by human minds

Then people obviously aren’t aware that bots have been indexing web pages and showing summarized information without going to the web page for three decades.

I think it's a bit intellectually dishonest to claim an equivalence between content indexing for search engines and machine learning for LLMs. They might share an underlying harvesting technique, but their uses -- indexing for information accessibility vs automatic content production are qualitatively different.

Further, almost every site has had an e.g. robots.txt which has permitted content harvesting only for certain accepted purposes for a couple decades now. So clearly people already had a sense of how they wanted their content harvested and for what purposes.

How is it not content production when I search for something on Google and get a box with similar questions and summarizes the answer.

So you’re okay with Google making money off of your content. But not OpenAI?