| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by krono 1323 days ago
	> trained on publicly available code Fully respecting the licenses this code was published under, one would hope?

3 comments

googlryas 1323 days ago

What do source code licenses say about using source code as training data? IANAL, I would imagine it's only relevant if the model spits out already existing licensed code, and that using the code as training data is largely irrelevant.

For a simpler example than code-generating ML, if I write a program to recognize a directory of source code vs non-source code, and my logic is if (unbalanced parenthesis count in all files > X) { return "not source code"; } else { return "source code"}.

And then I compute X by scanning over the linux kernel source and counting the amount of unbalanced parens, have I just committed a GPL violation if I don't GPL my source code recognizer?

link

krono 1323 days ago

We're talking large scale commercial repurposing of source code with worldwide redistribution here. Not some project you whipped up in 5 minutes to learn from, or automate some minor annoying task.

Unlicensed source code - the default - is still protected by copyright law. If it's hosted and served from a different jurisdiction where no exception exists for training data models.

Then there are also licenses that explicitly prohibit commercial usage to consider.

What it comes down to, as it always does, is that a small group of (practically) untouchable people are making money by abusing and thereby irreparably damaging the trust and good will of the collective.

It's a complex topic eh

link

serf 1323 days ago

I'm not sure if the training data would constitute an aggregation -- given the usually non-reversible nature of it -- but I found this.

"Where's the line between two separate programs, and one program with two parts? This is a legal question, which ultimately judges will decide. "[0]

[0]: https://www.gnu.org/licenses/gpl-faq.en.html#MereAggregation

link

IshKebab 1323 days ago

I don't think anyone seriously thinks that is required. The real issue is that these models can reproduce code they've been trained with and then you do need to be aware of the license. That would be fine except as far as I know none of the existing solutions warn you that the code they've produced is the same or very similar to copyrighted code they learnt it from.

That's the main difference from a human learning from copyrighted code (which is totally legal). If they have a good memory they might be able to reproduce copyrighted snippets, but they would usually (probably not always!) know they are doing that.

link

swyx 1323 days ago

crickets

link