"You agree that any and all content, including without limitation any and all text, graphics, logos, tools, photographs, images, illustrations, software or source code, audio and video, animations, and product feedback (collectively, “Content”) that you provide to the public Network (collectively, “Subscriber Content”), is perpetually and irrevocably licensed to Stack Overflow on a worldwide, royalty-free, non-exclusive basis pursuant to Creative Commons licensing terms (CC BY-SA 4.0), and you grant Stack Overflow the perpetual and irrevocable right and license to access, use, process, copy, distribute, export, display and to commercially exploit…" https://stackoverflow.com/legal/terms-of-service
Except that Stack Overflow’s CEO, in this very article, says that it’s a violation of the Creative Commons license to train an LLM on their answers. So what he’s actually proposing is very unclear.
> When AI companies sell their models to customers, they “are unable to attribute each and every one of the community members whose questions and answers were used to train the model, thereby breaching the Creative Commons license,” Chandrasekar says.
> Except that Stack Overflow’s CEO, in this very article, says that it’s a violation of the Creative Commons license to train an LLM on their answers.
Yes, because it's a license violation — "If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original". That includes derived data products, like AI models, built using the content.
That seems rather debatable. While I don't think the overall use case favors fair use, due to the commercial nature of most of the end products, the fact that such use is clearly transformative is definitely a positive factor on the side of LLM creators:
> A key consideration in later fair use cases is the extent to which the use is transformative. In the 1994 decision Campbell v. Acuff-Rose Music Inc,[13] the U.S. Supreme Court held that when the purpose of the use is transformative, this makes the first factor more likely to favor fair use.[14] Before the Campbell decision, federal Judge Pierre Leval argued that transformativeness is central to the fair use analysis in his 1990 article, Toward a Fair Use Standard.[11] Blanch v. Koons is another example of a fair use case that focused on transformativeness. In 2006, Jeff Koons used a photograph taken by commercial photographer Andrea Blanch in a collage painting.[15] Koons appropriated a central portion of an advertisement she had been commissioned to shoot for a magazine. Koons prevailed in part because his use was found transformative under the first fair use factor.
SO content is dual licensed to SO, giving them the right to "commercially exploit" it. That means they can relicense under terms that also allow commercial exploitation.
because i wrote the content, and now they're trying to say what others can or can't do with it. but sure, i agreed to some terms.
those terms are CC BY-SA 4.0 (https://stackoverflow.com/help/licensing), which says others can "remix, transform, and build upon the material
for any purpose, even commercially."
I'd think harvesting for AI training data falls under that. the "attribution" and "share alike" clauses make it kinda tricky though.
those terms are kind of brutal actually, I'm not sure if anyone is following them. you have to credit every answer you use in your code? and share your new totally different code?