| The article is worth reading, but a good summary is at the bottom: > This investigation demonstrates that GitHub Copilot can quote a body of code verbatim, but that it rarely does so, and when it does, it mostly quotes code that everybody quotes, and mostly at the beginning of a file, as if to break the ice. But there’s still one big difference between GitHub Copilot reciting code and me reciting a poem: I know when I’m quoting. I would also like to know when Copilot is echoing existing code rather than coming up with its own ideas. That way, I’m able to look up background information about that code, and to include credit where credit is due. The answer is obvious: sharing the prefiltering solution we used in this analysis to detect overlap with the training set. When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from. You can then either include proper attribution or decide against using that code altogether. This duplication search is not yet integrated into the technical preview, but we plan to do so. And we will both continue to work on decreasing rates of recitation, and on making its detection more precise. |
It sounds like you’re a part of the Copilot team. If so, then I’m happy to see the Copilot team cares about these issues at all. I was expecting nothing but stonewall until the conversation died out, since realistically the chance of the EFF bringing or winning a lawsuit seems small. (And who else would try?)
But when you anger the world and being so much attention to this delicate issue of copyright in AI, you risk every hobbyist. Suppose the world decides that AI models need to be restricted. Now every person who wants to get into AI will need to deal with it. I’m not sure anyone else cares, but I care, because it’s the difference between someone getting into woodworking (an unrestricted hobby) vs becoming a lawyer or doctor (the maximally restrictive hobby). The closer we are to the latter, the fewer ML practitioners we’ll see in the long run. And even though the world will go along fine — it always does — it’d be a sad outcome, since the only way it could happen is if gigantic corporations were flagrantly flying in the face of copyright spirit, daring it to punish you.
My point is, please care about the right things. No one cared about language filters on ML models outside of a select vocal group, yet look how deeply OpenAI took those concerns to heart. Everybody cares whether their personal or professional work is being ripped off by an overfitted AI model, and it wasn’t obvious that GitHub or OpenAI gave it more than a passing thought.
Backlinking to the training set should help. But it’s also going to catapult the concern of “holy moly, this code is GPL licensed!” to the front and center of anyone who works in corporate settings. Gamedev is particularly insular when it comes to GPL, and I can just imagine the conversations at various studios. “This thing might spit out GPL? We can’t use this.”
My point is, when you launch that new feature to address people’s concerns, please ensure it’s working. You won’t be able to do exact string matches against the training set; you can’t rely on “well, it’s slightly different, so it’s not really the same thing.” If it’s substantially similar, it needs to be cited. And that seems like a much tougher problem than merely building an index of matching code fragments.
If you launch it, and it doesn’t work, it’s going to stoke the flames. Careful not to roast.