Hacker News new | ask | show | jobs
by mring33621 1309 days ago
1) Starting off, I support AI/ML-based code generation/completion. I would be very happy for the day when I can figuratively wave my hand and get 80-90% of what I need.

2) It might be fair to allow authors to submit repos, along with some sort of 'proof of ownership' to Copilot, in order to exclude them from the training set. There might have to be an documented (agreed-upon?) schedule for 'retraining', in order for the exclusion list to take effect in a timely manner.

3) Or just allow authors to add a robots.txt to their repos, which specifies rules for training.

Just a few thoughts...

2 comments

Pushing the responsibility onto copyright owners rather than GitHub / Microsoft / Copilot seems unreasonable. I'm all for AI being used like this but it also needs to come with some checks and balances to ensure it's not just regurgitation copyright code.
OK, then just use existing copyright licensing:

If a permissive, biz-friendly license (Apache 2.0, maybe others) is found in a given Repo, then it can be used in training set

Otherwise, the repo cannot be used in a training set

And then every snippet ever created with that trained data would have to include an acknowledgement for every repository included in the training set.

The LICENSE file would be longer than the rest of the code.

(FWIW, I agree with you theoretically, but practically it's hard to get your head around what the ramifications of that would mean)

Many permissive licenses (including Apache 2.0) require attribution.
If Joe Bag’O’Donuts copies and pastes LGPL code into his own personal repository that has MIT license attached, is it safe for Copilot to train on it?

I’m really of the opinion that MS needs to document the training set and include a high bar for inclusion of additional repos.

Re 2: So a DMCA notice?