Hacker News new | ask | show | jobs
by mwigdahl 1344 days ago
Butterick sneakily asserts over and over that Copilot is simply retrieving code from Github ("Copilot's whizzy code-retrieval methods", "Copilot is merely a convenient alternative interface to a large corpus of open-source code", "our work is stashed in a big code library in the sky called Copilot"). This verbiage seems specifically chosen to present a misleading picture of what Copilot is and does.

Copilot is a set of trained weight values in a matrix. There is no source code stored in that matrix. The fact that someone can prompt Copilot with specifically chosen text to generate a short sequence of code that matches a corresponding segment of code used to train the model does not mean that it is somehow "just retrieving" that snippet. It is _generating_ that code, guided by the weight matrix, via pattern-matching based on the chosen textual prompt and surrounding context.

That distinction is significant because one of the primary defenses against copyright infringement in US law is if the derived work is transformative. Copilot is a work derived in part from Github code, but it has unique capabilities far beyond returning short snippets of input code, and the work itself is clearly an extensive transformation of the input data.

This is without even considering whether concrete _outputs_ of the model that happen to match code in a repository used to train it are themselves protected via copyright or not, which is another issue entirely (and not as cut and dried as many folks on here seem to think).

1 comments

Correct. He's written a great opening argument, as long as you're the sort of person who likes speeches. To me it was full of tricks to prime the reader into accepting his premises as axiomatic, from nuanced rhetoric to pull quotes with attractive color gradients. In my view his actual motivation is the typical 30% cut of any class action settlement that goes to the lawyers, and he sees a lucrative opportunity to combine two skillsets.