Hacker News new | ask | show | jobs
by googlryas 1321 days ago
What do source code licenses say about using source code as training data? IANAL, I would imagine it's only relevant if the model spits out already existing licensed code, and that using the code as training data is largely irrelevant.

For a simpler example than code-generating ML, if I write a program to recognize a directory of source code vs non-source code, and my logic is if (unbalanced parenthesis count in all files > X) { return "not source code"; } else { return "source code"}.

And then I compute X by scanning over the linux kernel source and counting the amount of unbalanced parens, have I just committed a GPL violation if I don't GPL my source code recognizer?

2 comments

We're talking large scale commercial repurposing of source code with worldwide redistribution here. Not some project you whipped up in 5 minutes to learn from, or automate some minor annoying task.

Unlicensed source code - the default - is still protected by copyright law. If it's hosted and served from a different jurisdiction where no exception exists for training data models.

Then there are also licenses that explicitly prohibit commercial usage to consider.

What it comes down to, as it always does, is that a small group of (practically) untouchable people are making money by abusing and thereby irreparably damaging the trust and good will of the collective.

It's a complex topic eh

I'm not sure if the training data would constitute an aggregation -- given the usually non-reversible nature of it -- but I found this.

"Where's the line between two separate programs, and one program with two parts? This is a legal question, which ultimately judges will decide. "[0]

[0]: https://www.gnu.org/licenses/gpl-faq.en.html#MereAggregation