| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by avmich 1162 days ago
	The data on Stack Overflow servers provided by users isn't really Stack Overflow's data.

3 comments

CharlesW 1162 days ago

"You agree that any and all content, including without limitation any and all text, graphics, logos, tools, photographs, images, illustrations, software or source code, audio and video, animations, and product feedback (collectively, “Content”) that you provide to the public Network (collectively, “Subscriber Content”), is perpetually and irrevocably licensed to Stack Overflow on a worldwide, royalty-free, non-exclusive basis pursuant to Creative Commons licensing terms (CC BY-SA 4.0), and you grant Stack Overflow the perpetual and irrevocable right and license to access, use, process, copy, distribute, export, display and to commercially exploit…" https://stackoverflow.com/legal/terms-of-service

link

disntthinkthis 1162 days ago

Except that Stack Overflow’s CEO, in this very article, says that it’s a violation of the Creative Commons license to train an LLM on their answers. So what he’s actually proposing is very unclear.

> When AI companies sell their models to customers, they “are unable to attribute each and every one of the community members whose questions and answers were used to train the model, thereby breaching the Creative Commons license,” Chandrasekar says.

link

CharlesW 1162 days ago

> Except that Stack Overflow’s CEO, in this very article, says that it’s a violation of the Creative Commons license to train an LLM on their answers.

Yes, because it's a license violation — "If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original". That includes derived data products, like AI models, built using the content.

link

Paul-Craft 1162 days ago

That seems rather debatable. While I don't think the overall use case favors fair use, due to the commercial nature of most of the end products, the fact that such use is clearly transformative is definitely a positive factor on the side of LLM creators:

> A key consideration in later fair use cases is the extent to which the use is transformative. In the 1994 decision Campbell v. Acuff-Rose Music Inc,[13] the U.S. Supreme Court held that when the purpose of the use is transformative, this makes the first factor more likely to favor fair use.[14] Before the Campbell decision, federal Judge Pierre Leval argued that transformativeness is central to the fair use analysis in his 1990 article, Toward a Fair Use Standard.[11] Blanch v. Koons is another example of a fair use case that focused on transformativeness. In 2006, Jeff Koons used a photograph taken by commercial photographer Andrea Blanch in a collage painting.[15] Koons appropriated a central portion of an advertisement she had been commissioned to shoot for a magazine. Koons prevailed in part because his use was found transformative under the first fair use factor.

https://en.wikipedia.org/wiki/Fair_use

link

disntthinkthis 1162 days ago

…yet in the same article he’s talking about selling the data to LLM developers.

It’s hard to make sense of.

link

Paul-Craft 1162 days ago

SO content is dual licensed to SO, giving them the right to "commercially exploit" it. That means they can relicense under terms that also allow commercial exploitation.

https://stackoverflow.com/legal/terms-of-service

link

8n4vidtmkvmk 1162 days ago

right? like, as a top 500 user, do i get a say in this?

link

okdood64 1161 days ago

Why should you? It's their platform.

EDIT: Someone just posted snippet of their ToS, seems like it's not your data.

link

8n4vidtmkvmk 1160 days ago

because i wrote the content, and now they're trying to say what others can or can't do with it. but sure, i agreed to some terms.

those terms are CC BY-SA 4.0 (https://stackoverflow.com/help/licensing), which says others can "remix, transform, and build upon the material for any purpose, even commercially."

I'd think harvesting for AI training data falls under that. the "attribution" and "share alike" clauses make it kinda tricky though.

those terms are kind of brutal actually, I'm not sure if anyone is following them. you have to credit every answer you use in your code? and share your new totally different code?

link

kordlessagain 1162 days ago

Right, but SO did provide a place for it to live and that’s worth something.

link

avmich 1162 days ago

Yeah, but not the right to the data.

link