Hacker News new | ask | show | jobs
by VBprogrammer 1309 days ago
Pushing the responsibility onto copyright owners rather than GitHub / Microsoft / Copilot seems unreasonable. I'm all for AI being used like this but it also needs to come with some checks and balances to ensure it's not just regurgitation copyright code.
1 comments

OK, then just use existing copyright licensing:

If a permissive, biz-friendly license (Apache 2.0, maybe others) is found in a given Repo, then it can be used in training set

Otherwise, the repo cannot be used in a training set

And then every snippet ever created with that trained data would have to include an acknowledgement for every repository included in the training set.

The LICENSE file would be longer than the rest of the code.

(FWIW, I agree with you theoretically, but practically it's hard to get your head around what the ramifications of that would mean)

Many permissive licenses (including Apache 2.0) require attribution.
If Joe Bag’O’Donuts copies and pastes LGPL code into his own personal repository that has MIT license attached, is it safe for Copilot to train on it?

I’m really of the opinion that MS needs to document the training set and include a high bar for inclusion of additional repos.