Hacker News new | ask | show | jobs
by HomeDeLaPot 1748 days ago
Merely hosting your code publicly seems like it wouldn't give GitHub the right to train AI models on it. You could even say it's against your terms of use. And to do it, they would have to go out of their way to find your repo on the web and clone it—unlikely.

My impression (NOT A LAWYER) is that by hosting your code in a public repo on GitHub, you agree to their terms and give them the right to "read" your code including training AI models on it. Or at least that's what they're banking on.

Go host on Sourcehut or self-host with Gitea, and I would think it unlikely (but not impossible) that any big company would use your code to train their AI.

3 comments

It's not even very clear whether training an AI on OSS code is violation of those licenses. So unless you make your code public clearly under a proprietary license that clearly rejects such use, you can't really prevent people from doing that anyway.

Just imagine, there's really nothing preventing people from scraping your blog to train their natural language processing AI or whatever, why would code be any different? Even if you put up a big sign saying you don't consent to having your data ingested by a neural network, I doubt it will get noticed anyway...

People have been taking large OSS codebases (eg. Linux kernel) for various statistical analyses. AI is just doing the same thing in a more sophisticated manner.

I bet if I trained an AI on some vocalist and released an album I'd get some legal mayhem. I do concede it might go differently for code, but none of these issues are crystal clear for me.
I wish it were easier to convince projects I like and want to help migrate for the same reason. Committing to their repos does not put me in the clear--including mere mirrors.
I would think that training a NN falls squarely in fair use.