Hacker News new | ask | show | jobs
by jazzyjackson 1791 days ago
The point is, if they're sure they won't be recycling copyrighted code wholesale, why not include their own in the training set. Surely their internal code is higher quality than the average git repo, which must be 80% abandonware (if my personal repos are anything to go by :P)
1 comments

Probably because of the (very small) chance that Copilot could regurgitate something secret or embarrassing.

Which is not necessarily hypocritical. The amount of copying needed for something to be copyright infringement is not high… but it's still significantly higher than the amount needed to leak information. For that, just a few words will do, e.g.

    // For Windows 12
or

    // Fuck [company name]
or

    long secret_key[2] = {0x1234567812345678, 0x8765432187654321};
and open source codebases don't have code like that?
Not the parts that are secret or embarrassing.