Hacker News new | ask | show | jobs
by probably_wrong 1010 days ago
Random thought: my blog is licensed under a Creative Commons license [1] that allows you to use and transform my content as long as you give attribution and distribute your contributions under the same terms.

I found the OpenAI bot scraping my blog recently. Assuming they used that data, when will they attribute me?

[1] https://creativecommons.org/licenses/by-sa/4.0/

9 comments

These AI companies not complying with the licenses on code has meant since Microsoft released their code generator I haven't contributed a single line of open source software nor released any of my projects that way. I removed a bunch a while ago and I will likely remove all of them when I get around to it. I have been fixing bugs and releasing open source projects for decades and I just stopped the moment they did that. Open source is dead to me if the licenses can't be enforced.
Your license doesn't override copyright law.

Given that Google successfully used a fair use defense in Authors Guild, Inc. v. Google, Inc., I think it's likely OpenAI and the others will also win in court.

I do think it's possible for specific uses of the output of LLMs to be copyright infringement. That's why it's interesting to see Microsoft to indemnify customers of their commercial products in the event a case is brought against the customer. This is smart on Microsoft's part; the risk probably isn't very high and by making it a non-issue for their customers, many more will feel comfortable using their LLM-based features and services.

Well it all comes down to whether training an LLM is fair use or not. I think it is likely that courts rule it is transformative enough that training is allowed regardless of what terms you have for the use of content.
Interesting question, continuing on this, since they probably used GPL-3 code with the Affero clause, do they have to open source GPT? (The Affero clause is I believe the more directly applicable license thingy, though CC by-sa should also work.)

https://www.gnu.org/licenses/agpl-3.0.en.html

I think all the code license question does not matter much, because the code is data input, not a part of their actual program

Like githubs servers host AGPL code as data, without having to be open-source

The perceived problem there, is if their model generates an exact copy of some AGPL code, and you use it in your project unknowingly, and then you get can sued

How recent? Because ChatGPT is always on the same mantra, of its training being from back September 2021 with no updates...Even for ChatGPT-4
Where's your attributions for all the words written in your comment ;P you remixed the words and grammar patterns from other people's creative common's licenses of other people writings!

Note: i'm declaring my comment license as https://creativecommons.org/licenses/by-sa/4.0/

So if you remix or transform my comment by responding it, please attribute to me your response.

Humans are not LLMs trained and operated by a company for profit. Your argument is that LLMs hold all the same basic rights as humans but they hold (and should hold) exactly none.
That might be the implied argument, but the explicit argument appears to be the licensing a grouping of words as a work and then declaring any use of any of those words or letters in any order to be a transformation of that work without any other context or evidence of transformation is silly. We could rephrase the OP's post as:

>I found [logs of users from Paramount's writers offices reading] my blog recently. Assuming they used that data, when will they attribute me?

To see that the idea on the face is silly. OP has no evidence that any of their work was used at all, or even that what was used could even be covered under the license in the first place.

which works specially?

profit vs non profit also makes a difference

I believe that OpenAI is not required to attribute you if the output was produced by an OpenAI-operated AI model because the AI is not constrained by the Berne Convention treaty regime in the same way that people are.

I believe that this fact is and will be exploited to strip copyright and effectively transfer ownership using cleanroom/firewall techniques.

training will be ruled fair use which doesn't require any license, while there is no lawsuit on the output
> Assuming they used that data...

That's the key part. You haven't yet proved they have actually used your content for anything (other than, potentially, read the license to decide if they should include or discard from their training set).

But in practice we'll never know for sure if they are respecting the terms of licenses until 1) this is tested in court, or 2) there's some internal leak that points into either direction.

I expect that OpenAI would concede that they used the data in any court case immediately to get that issue off the table, I really don't think they have a strong interest in foot-dragging on this stuff, right?

I would think OpenAI wants the thornier legal issues actually settled so that the whole ecosystem can grow within those terms & they can lobby for the legal changes they need/want?

The alternative would be discovery on that issue, which they may want to avoid.
> wants the thornier legal issues actually settled

.. wants the thornier issues to be debated and re-tried ad infinitum, as long as they generate cash flow and build their moat(s).. more likely

https://the-decoder.com/openai-apparently-going-all-in-on-ch...

This behaviour seems more consistent with wanting is sorted out than stalling for time.

a motion to dismiss is "going all in" ?