Hacker News new | ask | show | jobs
by pydry 1779 days ago
I'm kind of wondering if this controversy might not end up being a storm in a teacup.

From what I've seen copilot really lowers the barrier to writing buggy code. If indeed it does turn out to be a tool that lends itself to machine gunning rather than shooting yourself in the foot it almost doesnt matter who owns what IP.

The relentless attempts at developer commodification will, of course, continue, but I can already sense this one ending up like the developer outsourcing craze of the mid-2000s that the Economist also got a little too excited about.

4 comments

If the code ends up being non-IP infringing on the code used to train it, that would be a big win for open source / free software.

Now you can just grab any leaked code of a closed source program, feed it into your AI and get back code you can license under the GPL and nobody can do anything about it.

An easy application I can think of is ZFS; simply feed the AI all CDDL licensed code, then ask it to reproduce ZFS. Probably will have some bugs but it would be licensable under GPL if the AI is considered a whiteroom.

I think you’re missing that the law considers intent. If the devs of copilot were not trying to set up infringement, then their algorithm’s output is likely not considered infringement [1]. However, if you set out to “launder” copyrighted material then the law will take that into consideration and likely find that you violated copyright. This intent can be demonstrated in court either via your statements, or your actions (such as constructing a meaninglessly tiny training set).

[1]: https://ilr.law.uiowa.edu/print/volume-101-issue-2/copyright...

Would it not be categorically intended as infringement regardless of the copyright status of the material?

It seems to me that the licensing part is the part you can't throw into a big markov chain, legally. Even if they aimed only at open-source licensed material without exception, the point where they discard all the licenses and export a 'generic' slurry is the point where they infringe by definition. If they trained on more restrictive licenses that's just doubling down: what's needed is annotation and maintenance of what bits of code came from what licensing pool. You could well have a giant pool of GPL, a giant pool of MIT (which I would be in, all the more since I maintain a very automatable code style that's easy to import from). You could accumulate a list of sources for anything you did, at whatever level of granularity is desired.

The purpose of throwing away this attribution is intent to infringe. It's constructing a machine for the explicit purpose of grinding code into sludge of intentionally small enough pieces that, if you reconstruct copyrighted code in your markov-chainy way, you've got grounds for pretending you didn't build your machine to do exactly that.

> you've got grounds for pretending you didn't build your machine to do exactly that.

I believe all laws about intent have to deal with determining who is pretending and who isn't. But these laws still exist, because there are ways to prove such things

I don't think that is so easy, a tiny training set would obviously defeat the point but the other part is that the AI can't commit copyright infringement, and I don't have to ask it to produce anything. I merely fed it copyrighted code and released it to other developers without documenting that fact. Possibly open source the entire bot, as no part of the AI would be under the restrictions of the training set.
Again, the law isn’t enforced by robots and is able to adapt such that “clever legal hacks” don’t typically work. Us programming nerds tend to think in terms of rigid, unambiguous rules that treat inputs as black boxes, but the law does not work like this.
I'm well aware but I don't think this is an issue here.
It's exactly the issue.

If the AI could be shown to have copied the code it would likely to be found to be infringement.

If it was found to have generated new unique code, and merely leant how to program from the code it was trained on it likely wouldn't.

In either case, this is different to a clean-room implementation (which I think is what you said by "white room").

Clean-room implementations are supposed to protect against trade secret infringement, and are mostly used when building interop with hardware (where compatibility has special carve-outs).

If a person or AI had seen copyright code used in the project it would never be considered clean room.

But CDDL code is fine for a person or AI to learn from when building a new, incompatible implementation that doesn't share any code.

If you hire a programmer that has worked on said closed source and ask them to recreate that code in your GPL licensed program - at what point will that be considered infring on the original? Can you relicense code of you feed it through a programmer?
A programmer is a living breathing human, a legal entity that can hold copyright and is capable of violating copyright.

An AI cannot hold copyright however and isn't capable of violating copyright (legal entities are, which an AI is not).

It’s called a transpiler; you don’t need AI for this, but it’s obviously still licensed same as original - because it is the original, only translated.
I'm not talking about a transpiler, I'm talking about feeding massive amounts of non-GPL code into an AI and then ask it to produce new code based on that. A transpiler would simply take a single codebase and translate it into a new format, the obvious difference being that such a tool has an obvious and introspectable transformation function.
If the resulting code works the same way, it’s still a transpiler. If the resulting code works in a different way… then I have to ask what exactly does it do and how’s that supposed to be useful.
There should be no reason for CDDL to not be includable under GPL license. Canonical does include it for example.
This would eliminate any ambiguity, ZFS would simply be GPL now.
If you replaced CDDL with GPL you’d lose patent protection. Good luck with Oracle lawyers.
Most of the relevant ZFS patents already expired, I don't think there is anything for the Oracle lawyers. Plus i live in a country where Software patents aren't recognized, so double good luck to Oracle.
On some level yes. This is basically shifting closed repos to a trade secret status.
>From what I've seen copilot really lowers the barrier to writing buggy code

There is no barrier to writing buggy code. Writing buggy code is considered trivial in any language.

I'd estimate about half of all programming is throwing up barriers to writing buggy code.

There's a constant tension between building fast and right (or should be if you're not fucking up).

Is that true with functional languages? I can see a class of buggy code that can be very hard to write. But maybe I'm not seeing the whole picture.
Sure. Regardless of how wacky your definition of "functional" gets, it is possible and relatively easy to write bugs in Python, Scheme, Haskell, or OCaml; all of these languages confuse `x + y` and `x - y`. Idris, Agda, or Coq can catch that mistake, but still suffer "Boolean blindness" and other traditional problems.

There are plenty of bug classes which are trivial in any language; plan interference is a good example. Languages provably cannot avoid these bugs entirely, just make them less easy.

I’ve used it/do use it and it helps to fill out obvious stuff - it didn’t make me much quicker.

The part that takes the longest is working out the tests and what the code should do, the actual internals of the implementation are simple, boring, and obvious.

Automate that and it makes developing even more fun that it is today.

I tend to find if it's that obvious you're probably already using a library.

Or, if you're not, you should be.

But, if copilot instead suggests just writing out the contents of the library directly into your code base a lot of people will do just that. That'll be lots of fun when you're trying to track down obscure bugs in huge piles of murky "copilot assisted" code.

It'll be especially bad in environments where developers feel either extrinsic or intrinsic pressure to always write more SLOC and churn out more PRs because it will allow developers to create a very compelling illusion of productivity.

I have a feeling this will be one of the long term side effects of copilot. I'm actually suspicious that this dynamic will blow away all of the productivity gains and then some and might lead to companies banning its use when they realize the true costs of sifting through the GPT spew.

I think we are using “obvious” in a different way, I mean like if I want to write an if statement or something that is easy to write, it does it for me.
Writing an if statement takes a couple of seconds anyway though, doesnt it?
I run through this point with other developers a lot. There are hard technical problems out there but a great deal of difficulty in programming is in reasoning about a domain. If Copilot is good enough that it can solve problems in any domain, is it close enough to AGI that we can call it a day?
> relentless attempts at developer commodification

LOL, it's been happening since the beginning of software. So many things reduce or replace developer work - compilers, libraries, templates, free/open tools. Desire is always going to expand to contain the whole space of what's possible and then overflow.