What do source code licenses say about using source code as training data? IANAL, I would imagine it's only relevant if the model spits out already existing licensed code, and that using the code as training data is largely irrelevant.
For a simpler example than code-generating ML, if I write a program to recognize a directory of source code vs non-source code, and my logic is if (unbalanced parenthesis count in all files > X) { return "not source code"; } else { return "source code"}.
And then I compute X by scanning over the linux kernel source and counting the amount of unbalanced parens, have I just committed a GPL violation if I don't GPL my source code recognizer?
We're talking large scale commercial repurposing of source code with worldwide redistribution here. Not some project you whipped up in 5 minutes to learn from, or automate some minor annoying task.
Unlicensed source code - the default - is still protected by copyright law. If it's hosted and served from a different jurisdiction where no exception exists for training data models.
Then there are also licenses that explicitly prohibit commercial usage to consider.
What it comes down to, as it always does, is that a small group of (practically) untouchable people are making money by abusing and thereby irreparably damaging the trust and good will of the collective.
I don't think anyone seriously thinks that is required. The real issue is that these models can reproduce code they've been trained with and then you do need to be aware of the license. That would be fine except as far as I know none of the existing solutions warn you that the code they've produced is the same or very similar to copyrighted code they learnt it from.
That's the main difference from a human learning from copyrighted code (which is totally legal). If they have a good memory they might be able to reproduce copyrighted snippets, but they would usually (probably not always!) know they are doing that.
For a simpler example than code-generating ML, if I write a program to recognize a directory of source code vs non-source code, and my logic is if (unbalanced parenthesis count in all files > X) { return "not source code"; } else { return "source code"}.
And then I compute X by scanning over the linux kernel source and counting the amount of unbalanced parens, have I just committed a GPL violation if I don't GPL my source code recognizer?