|
|
|
|
|
by IncRnd
1817 days ago
|
|
From https://docs.github.com/en/github/copilot/research-recitatio...: "Once, GitHub Copilot suggested starting an empty file with something it had even seen more than a whopping 700,000 different times during training -- that was the GNU General Public License." On the same page is an image showing copilot in real-time adding the text of the famous python poem, The Zen of Python. See https://docs.github.com/assets/images/help/copilot/resources... for a link directly to copilot doing this. You are making arguments about what you read instead of objectively observing how copilot operates. Just because GH wrote that copilot synthesizes new code doesn't mean that it writes new code in the way that a human writes code. That is not what is happening here. It is replicating code. Even in the best case copilot is creating derivative works from code where GH is not the copyright owner. |
|
Of course I am. We are both participating in a speculative discussion of how copyright law should handle ML code synthesis. I think this is really clear from the context, and it seems obvious to me that this product will not be able to move beyond the technical preview stage if it continues to make a habit of copying distinctive code and comments verbatim, so that scenario isn't really interesting to me. Github seems to agree (from the page on recitation that you linked):
> This investigation demonstrates that GitHub Copilot can quote a body of code verbatim, but that it rarely does so, and when it does, it mostly quotes code that everybody quotes, and mostly at the beginning of a file, as if to break the ice.
> But there’s still one big difference between GitHub Copilot reciting code and me reciting a poem: I know when I’m quoting. I would also like to know when Copilot is echoing existing code rather than coming up with its own ideas. That way, I’m able to look up background information about that code, and to include credit where credit is due.
> The answer is obvious: sharing the prefiltering solution we used in this analysis to detect overlap with the training set. When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from. You can then either include proper attribution or decide against using that code altogether.
> This duplication search is not yet integrated into the technical preview, but we plan to do so. And we will both continue to work on decreasing rates of recitation, and on making its detection more precise.
The arguments you've made here would seem to apply equally well to a version of Copilot hardened against "recitation", hence my reply.
> Even in the best case copilot is creating derivative works from code where GH is not the copyright owner.
It would be convenient for your argument(s) if it were decided legal fact that ML-synthesized code is derivative work, but it seems far from obvious to me (in fact, I would disagree) and you haven't articulated a real argument to that effect yourself. It has also definitely not been decided by any legal entity capable of establishing precedent.
And, again, if this is what you believe then I'm not sure how the work of human programmers is supposed to be any different in the eyes of copyright law.