Hacker News new | ask | show | jobs
by htpltr 1132 days ago
Antitrust is one thing, but by cleanroom implementation standards (one team reads the source and writes a spec, another team writes the code) CoPilot is illegal to begin with.

CoPilot reads and rearranges the IP that was created by millions of people who were working very hard and did not anticipate a code laundering machine when they wrote the code and the licenses.

4 comments

That's quite an extreme set of statements, and I very much doubt what you consider "illegal" is actually illegal.

When you publish something for others to view (text, images, code, whatever), others are allowed to view it. You can't anticipate how others view it, with their eyes or with screenreaders to assist. You can't stop them from reading it, thinking about it, discussing it with their friends, taking notes, summarizing it. You can't stop people from learning from your published content or recognizing patterns between it and other similar things.

Sorry, but you can't create a license that says "I will allow you to view this but you cannot learn from it. If you learn from it, you need to pay me."

Learning is very different from copying. I can take a movie and converts it to different formats and resolutions. I can use an AI algorithms to remove rough edges, and even add color to images which was taken in black and white. None of that would be covered by using the word learning, even if the program takes the movie as input and learns from it and outputs a work with is completely different from the original.

The word that seems to fit best is transforming and adapting. In order to adapt something, one has to first learn from the original in order to produce the derivative work. This is however covered by copyright, since the transforming and adapting is still considered a form of copying even if all people did was learning and producing something unique but similar to the original.

The license can say that "I will allow you to view this but you cannot create a derviate work from it".

This isn’t about a person learning, however. This is about developing an algorithm through the inclusion of GPL licensed code, that might — and has — verbatim emitted that code. Those seem materially different to me.
You can without attribution verbatim copy the parts of GPL code that is not covered by copyright, such as anything purely functional, like an optimized sorting algorithm.

Copyright is for art. Patents are for utilities and tools.

The art in GPL code is in the arbitrary decisions made about how to structure that code… the class structure and not the algorithms.

You cannot copyright an algorithm and for very good reason. Think if Microsoft had the assumed powers granted by the GPL!

Microsoft is not training their code autocomplete on parts of GPL/MIT/etc code that is not covered by copyright. They are training it on all of the codebase.
What part of the codebase are the tools reproducing? The copyrightable aspects of software is generally at the structural level and not at the function level as most independent functions are utilitarian and not expressive in nature.

If these tools were not context dependent they would not be very useful. These tools aim to only reproduce the non-copyrightable aspects of code and in a context-aware manner.

I have yet to see a case where Copilot has returned code that is something other than the kind of functional, utilitarian code that is explicitly not covered by copyright.

Patents? Perhaps! But that’s another discussion.

If the purpose of processing copyrighted works is to learn the underlying structure and produce further works that are not independently derivative then the courts have a history of judging in favor of fair use.

Copyright is about artistic expression and not functionality.

Clean room is not the actual requirement for avoiding copyright infringement in reverse engineering. There have been several notable cases in which clean room practices were either not followed or outright disregarded, but the resulting product was considered to be non-infringing anyway[0].

Furthermore, while lots of hard work was put into the code that CoPilot used, that hard work was specifically donated with the intent that the code be reused. The only hard requirement being that the code remain free. The thing people are angry about with CoPilot is that it's a hosted OpenAI product with no freely-available model weights, and that generated code might be regurgitated from training data in some cases[1]. If CoPilot was actually open AI, nobody would be suing over it.

[0] In Sony v. Connectix, it was found that Connectix actually tried clean-room, black-box analysis of the PlayStation ROM, but abandoned it in favor of disassembling the whole thing. Connectix was still ruled non-infringing.

[1] Most egregiously, the comment "evil floating point bit level hacking" will make it spit out Quake III source. Microsoft worked around this by explicitly banning that particular phrase, which is just stupid.

Clean room implementations are there to make sure that none of the arbitrary, artistically expressive parts of the code are inadvertently copied.

Class structure, file structure, APIs…

Clean implementation is an approach to guarantee a lack of pollution. It is not the minimum level necessary to avoid it.