Hacker News new | ask | show | jobs
Stack Overflow Will Charge AI Giants for Training Data (wired.com)
71 points by pd33 1162 days ago
8 comments

So the "AI Giants" that have already trained models using SO / Reddit data will have a perpetual advantage over any newcomers trying to come up. So yeah, totally not a fan of this position from SO / Reddit. Anybody trying to democratize access to foundation models, who isn't (Google|OpenAI|Meta|Microsoft) is now going to find the on-ramp even steeper than ever. As if it wasn't bad enough just paying for compute time.

OTOH, I get why Reddit, SO, etc. would take this position. And I have some sympathy for them in that regard. But the idea of locking in the centralization of powerful AI models is, to me, a bigger problem than Reddit or SO optimizing their profit margin by a percentage point or two.

Perhaps but stack overflow answers have a shelf life. Who cares how to fix an obscure React 3.0 issue these days?
One of my main use of AI right now is using it to answer legacy questions. Searching for old AngularJS (not Angular) question online is painful, yet ChatGPT is able to come up with explanations, code snippets and even outright provide code based on prompts.
Per https://stackoverflow.com/help/licensing:

    As noted in the Stack Exchange Terms of Service and in the footer of every page, all publicly accessible user 
    contributions are licensed under Creative Commons Attribution-ShareAlike license as follows:

    Content contributed before 2011-04-08 (UTC) is distributed under the terms of CC BY-SA 2.5.
    Content contributed from 2011-04-08 up to but not including 2018-05-02 (UTC) is distributed under the terms of CC BY-SA 3.0.
    Content contributed on or after 2018-05-02 (UTC) is distributed under the terms of CC BY-SA 4.0.
what does it mean in simple words? as one of many StackOverflow contributors (although very small) I don't want that my answers were wall-guarded by SO website, they are free to use advertising revenue from traffic generated by my content to remunerate for creation and supporting the platform, but the content itself is mine, why it shouldn't be that way?
Meanwhile I was considering an addendum to my GPL licenses to read something like, "use of this code is strictly limited to humans. under no circumstances may an artificial agent use this for training. Any evidence of sufficient similarity in an AI output will be construed as a violation of this license".

Obviously it wont work, but I do wonder if this is a direction GPL4 should be looking into ...

The data on Stack Overflow servers provided by users isn't really Stack Overflow's data.
"You agree that any and all content, including without limitation any and all text, graphics, logos, tools, photographs, images, illustrations, software or source code, audio and video, animations, and product feedback (collectively, “Content”) that you provide to the public Network (collectively, “Subscriber Content”), is perpetually and irrevocably licensed to Stack Overflow on a worldwide, royalty-free, non-exclusive basis pursuant to Creative Commons licensing terms (CC BY-SA 4.0), and you grant Stack Overflow the perpetual and irrevocable right and license to access, use, process, copy, distribute, export, display and to commercially exploit…" https://stackoverflow.com/legal/terms-of-service
Except that Stack Overflow’s CEO, in this very article, says that it’s a violation of the Creative Commons license to train an LLM on their answers. So what he’s actually proposing is very unclear.

> When AI companies sell their models to customers, they “are unable to attribute each and every one of the community members whose questions and answers were used to train the model, thereby breaching the Creative Commons license,” Chandrasekar says.

> Except that Stack Overflow’s CEO, in this very article, says that it’s a violation of the Creative Commons license to train an LLM on their answers.

Yes, because it's a license violation — "If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original". That includes derived data products, like AI models, built using the content.

That seems rather debatable. While I don't think the overall use case favors fair use, due to the commercial nature of most of the end products, the fact that such use is clearly transformative is definitely a positive factor on the side of LLM creators:

> A key consideration in later fair use cases is the extent to which the use is transformative. In the 1994 decision Campbell v. Acuff-Rose Music Inc,[13] the U.S. Supreme Court held that when the purpose of the use is transformative, this makes the first factor more likely to favor fair use.[14] Before the Campbell decision, federal Judge Pierre Leval argued that transformativeness is central to the fair use analysis in his 1990 article, Toward a Fair Use Standard.[11] Blanch v. Koons is another example of a fair use case that focused on transformativeness. In 2006, Jeff Koons used a photograph taken by commercial photographer Andrea Blanch in a collage painting.[15] Koons appropriated a central portion of an advertisement she had been commissioned to shoot for a magazine. Koons prevailed in part because his use was found transformative under the first fair use factor.

https://en.wikipedia.org/wiki/Fair_use

…yet in the same article he’s talking about selling the data to LLM developers.

It’s hard to make sense of.

SO content is dual licensed to SO, giving them the right to "commercially exploit" it. That means they can relicense under terms that also allow commercial exploitation.

https://stackoverflow.com/legal/terms-of-service

right? like, as a top 500 user, do i get a say in this?
Why should you? It's their platform.

EDIT: Someone just posted snippet of their ToS, seems like it's not your data.

because i wrote the content, and now they're trying to say what others can or can't do with it. but sure, i agreed to some terms.

those terms are CC BY-SA 4.0 (https://stackoverflow.com/help/licensing), which says others can "remix, transform, and build upon the material for any purpose, even commercially."

I'd think harvesting for AI training data falls under that. the "attribution" and "share alike" clauses make it kinda tricky though.

those terms are kind of brutal actually, I'm not sure if anyone is following them. you have to credit every answer you use in your code? and share your new totally different code?

Right, but SO did provide a place for it to live and that’s worth something.
Yeah, but not the right to the data.
Ironic.

SO content is user generated. It is a bit rich for them to put a wall around it and claim that it is chargeable for use as an AI training set by other companies (unless they also have a plan to share the income with their users).

Especially so given their stance on cash bounties for answers/sponsored questions etc.,

The mods on SO and StackExchange family of sites frown upon cash bounties for answers. But when SO itself wants to erect a paywall around the Question-Answer set for AI training, it is somehow clean and moral.

Smells like BS.

https://meta.stackoverflow.com/questions/251576/how-open-is-...

https://meta.stackexchange.com/questions/25615/offering-actu...

https://meta.stackexchange.com/questions/57850/pay-money-to-...

Contemporary AI systems are now becoming human-competitive at general tasks,[3] and we must ask ourselves: Should we let machines flood our information channels with propaganda and untruth? Should we automate away all the jobs, including the fulfilling ones? Should we develop nonhuman minds that might eventually outnumber, outsmart, obsolete and replace us? Should we risk loss of control of our civilization? Such decisions must not be delegated to unelected tech leaders. Powerful AI systems should be developed only once we are confident that their effects will be positive and their risks will be manageable. This confidence must be well justified and increase with the magnitude of a system's potential effects. OpenAI's recent statement regarding artificial general intelligence, states that "At some point, it may be important to get independent review before starting to train future systems, and for the most advanced efforts to agree to limit the rate of growth of compute used for creating new models." We agree. That point is now.
“Old man shouting at the clouds”

Its not easy to stop technology with regulation. At this point companies in many countries have started building their LLMs on the whole internet.

Just upgrade yourself)

This is how it should be. You should not get to train on other people's data for free. It sucks it has to be SO who puts the foot down but at least this is in the right direction. This way, the balance of power is more in content creators' hands, and this avoids the "death of original content creation" people keep complaining about. If you a) have to pay for your training data or b) have to generate your own data, then that incentivizes people to not charge cheapely, and good data from intellectuals, artists, writers, etc, becomes valuable and not something that can just be taken for free.
i don't think SO is going to compensate its contributors who actually provided the answers though
Yes, that's the "unfortunate part" but it's a wasted opportunity the artists raging against AI haven't figured out this is a way to get what they want.
I don't think that's what "artists raging against AI" want at all.
I understand the business motivation, especially because this is likely existential for SO, but it feels perverse to take information that people gave to help others and try to paywall it.
Agreed. But that’s the same mantra of any big co like Twitter, Facebook etc. they all monetize UGC in some form or the other.
Yes and no. My first thought was "... and, what, pay royalties to the contributors?" But the thing that the ML community has been doing is wholesale IP laundering and somebody needs to hold them to task for that.
Isn’t this exactly what Open AI did ? Took SO’s data and put it behind a paywall, called it “AI”, it’s brilliant.
Paywalled for AI not for people doing Q&A.