|
|
|
|
|
by googlryas
1321 days ago
|
|
What do source code licenses say about using source code as training data? IANAL, I would imagine it's only relevant if the model spits out already existing licensed code, and that using the code as training data is largely irrelevant. For a simpler example than code-generating ML, if I write a program to recognize a directory of source code vs non-source code, and my logic is if (unbalanced parenthesis count in all files > X) { return "not source code"; } else { return "source code"}. And then I compute X by scanning over the linux kernel source and counting the amount of unbalanced parens, have I just committed a GPL violation if I don't GPL my source code recognizer? |
|
Unlicensed source code - the default - is still protected by copyright law. If it's hosted and served from a different jurisdiction where no exception exists for training data models.
Then there are also licenses that explicitly prohibit commercial usage to consider.
What it comes down to, as it always does, is that a small group of (practically) untouchable people are making money by abusing and thereby irreparably damaging the trust and good will of the collective.
It's a complex topic eh