Hacker News new | ask | show | jobs
by JustLurking2022 1258 days ago
It'll be interesting to see the legal battles around this considering, for humans, a clean room approach is often used to reimplementing an API, as it becomes an easy case for infringement (even if accidental) if you've seen the original implementation. If you ask ChatGPT to reimplement an API from a project it's indexed, seems it will likely infringe to some extent every time
1 comments

Clean room design is there to ensure that the parts of the code that are expressive are not copied… the class structures and overall “shape”, the sort of arbitrary differences between different ways to organize data structures, functions and types (think: what everyone argues the most about).

If some functionality can basically only be expressed in one general way then this is not subject to copyright. This could either be due to performance constraints, simplicity, or something based on a mathematical algorithm.

It is incredibly hard to read through a codebase and not be inspired by the structure. “Oh, I like how this code was organized!”… it is hard to “forget” that! That’s the real reason for clean room design. To keep people from inadvertently using the “expressive, non-utilitarian structures” that ARE covered by copyright in works that are otherwise useful and subject to mechanical inventions covered by patents.

Yes, and if you ask an AI to reimplement functionally for which it has previously seen an exact implementation, it will likely spit out something that looks heavily influenced, if not identical, to the original. That's the thing about ML - it's still just trying to make the result match the query and doesn't really "care" about the originality of the response.
First, these code language models are best at the utilitarian aspects of programming and worst at the expressive aspects. At this point Copilot is going lead you down pretty awful paths towards organizing and structuring your code.

Second, when these language models improve to the point where they can assist in organizing and structuring code, this is where a given individual will be most prone to disagreeing with the model and having individual preferences for data types and whatnot!

Third, language models work best when they copy all works indiscriminately. Copyright is an individual right granted to a human author. Infringements are against specific human author's rights to their specific individual works, not all human authors and their unrealized collective works! The courts might find that the language models are not infringements but that individuals who USE these tools can indeed be found infringing on specific works! This would mean that these tools are putting the liability entirely on the user of the tools. I would expect that having these tools notify users that they are likely in violation of some kind of copyright claim would be very useful, a sort of "fingerprinting" of the abstract individual expressions of code structuring. This would probably expand on the growing notion of prior art in copyright and probably expose a lot of coincidental, "clean" patterns that should not be covered by copyright due to the frequency at which people tend to express in that given manner... like, the Factory Pattern or something like that.