Hacker News new | ask | show | jobs
by spupe 1459 days ago
If you assigned a task to a junior dev, and he/she used some code from open source projects and Stack Overflow to develop a custom program for the task, would you say that this person is selling you other people's code? Is it common or expected for this type of use to be acknowledged?
8 comments

People I've worked with have different philosophies on this, but personally, if you check in code that is distinctive enough that I can identify the source you copied and pasted it from, and you provided no indication (whether in a comment or a PR description) that you copied it, I will really get quite grumpy at you about it.

Way too often I burn half an hour needlessly during review in one of two ways:

* trying to figure out how the heck someone figured out some "magic" code that achieves something by invoking a bunch of poorly documented library or framework internals, and trying to reverse engineer WTF all the magic does by diving into the framework's source... only to eventually think to google the whole snippet rather than each individual method call, and discover it's copied from a Stack Overflow answer

* trying to figure out why something was written in an unidiomatic or overcomplicated way rather than a more obvious approach, and commenting at length on how I'd simplify it... only to eventually realise it was copied from a Stack Overflow answer

Attribution isn't just about making sure the right person gets credit, or about license compliance; reviewers and maintainers frequently need to be able to see where stuff was copied and pasted from in order to do their jobs effectively, even for snippets of just a few lines.

I understand where you are coming from. However, I think you are making the assumption that this person simply copy/pasted some code with no understanding of it, or that this code is then very different from your codebase and needs to be refactored. If using Stack Overflow did not add to your overall development time but subtracted from it, because it was used as an appropriate piece of a much bigger puzzle - a far more realistic scenario for both Copilot and our general use of SO -, then I see no issue with it whatsoever. Certainly no moral or copyright issues as this person on Twitter implies.
No copyright issues in the sense that no entity is likely to ever pursue the matter, sure. But copying and commercially using someone else's nontrivial bit of code that doesn't have a license that says you can is quite blatantly a copyright violation.
About 10 years ago or so, I was working at a certain place. They put me into a small team apparently focused on some R+D project under the direction of an "architect".

Basically, the project was to package Cordova + Backbone + Marionette, plus a couple of tools, under their own commercial name. Then they'd go around potential clients presenting it as the perfect solution to build hybrid applications for web/mobile/smartTV/whatever.

A certain Monday, the "architect" arrived boasting. He did that often, but this time he was more boastful. He explained that he had spent the whole weekend coding. He had written an incredible tool that would create a skeleton for a project from zero. You would type something like `tool create` and it would create the whole project with all the scripts and some example views and whatnot.

It was Yeoman's yo CLI tool, of course. He had just changed the copyright in the comments, removed most of the comments, he had deleted any mention to yeoman or the original creators, changed the name of the executable script and that's it.

The whole thing was OS code picked up from various repos and packaged as their own. The company used it to sell development projects. The so-called-architect used it to sell himself inside the company and then jump away into a startup as CTO.

Is this common or is it just anecdata? I don't know. It's clearly not the only time I've seen something like this and I do know that in certain companies around here it isn't exactly uncommon. But I can't say how common or uncommon it is.

Would I call this "selling other people's code"? Yes, I would.

This is clear-cut fraud, but it is also not even close to what Copilot or most junior devs are doing.
If the solution was made up of ideas from OSS and snippets from Stack Overflow? No; that's fine.

If the solution was copied from an OSS project without proper attribution? Yes. Absolutely. And they'd have words with a senior dev and maybe even legal if the code they copied made its way into production without attribution.

Many copyleft OSS licenses require attribution and distribution of derivative works that we wouldn't allow.

It depends on the source of that code and the expected license of the code you paid them for. If everything is MIT/BSD (and attributed), no problem. If the code was GPL and I’m making a commercial product, we have an issue.

I’d also expect for any stack overflow code to include a comment with a link to the stack overflow page.

I think one of the key points is to make sure any code taken from another source is cited appropriately. If it isn’t, or the junior dev is passing it off as their own work, then we have problems.

If I found out a junior dev had been copying copy-left or proprietary code then I'd have to rip out that code, have a chat with them and figure out what to do from there. Even if the code isn't copy-left it's still someone else's code, sometimes that's ok but sometimes it's definitely not.
No matter how complex a program is, and no matter whether it uses techniques sometimes described as "AI" in its implementation, it's not a person. Copilot is just a very complex pipeline from other people's code to your editor, which ignores the license of those other people's code.
This is a good thought exercise. I wouldn't call it stealing, though I am not sure how legal liability is assessed, say if they picked up GPL code unknown to the company, and the company is later sued over it.

This isn't derived from principled reasoning, but I think of it as similar to community norms. Not the best example, but you wouldn't mind someone subletting their homes to Airbnb, but if all of your apartment complex does it, it invites regulation. A product like copilot enables copying code (even if inspired, and not verbatim) at a scale that individual developers can't. So respecting software licenses needs to be codified (legally?) while previously it was left unmonitored.

It's absolutely fine to allow humans to do that while prohibiting (commercialized) AI to do the same thing.
I don't see why that should be the case in this particular scenario, or what benefit is gained from that. Could you elaborate?
Could you elaborate on why you think a computer program and a person should be treated the same way in this respect?

We can take as self-evident that a human is capable of reading about something, conceptualising it, and then writing something completely new with the knowledge they have gained.

I think it's also pretty uncontroversial that the primitive "AI" we currently have is nowhere near the level of even an average human at these things, and thus we can't just blindly assume it is conceptualising rather than copying. Copilot regularly produces verbatim copies of existing code when working on non-trivial things.

Forget about the "AI" label: Copilot is just a complex computer program, that takes code from other people and inserts various permutations of it into your editor, whilst ignoring the license of that code.

I think it's best if we sidestep these big conceptual questions about what cognition or creativity really are. It's hard to find agreement, and perhaps it is not necessary to do so.

My position is that if a person hired in a company can currently use Google, Stack Overflow and GitHub to help develop their custom scripts, and no moral or copyright issues are infringed (ie, you don't try to say you came up with it on your own, and you use only enough that it is clearly fair use), then I think an AI should be able to assist in that task. There is no need to complicate things by legislating what the AI is doing and what Google is doing, as they are very similar things and in fact even use similar methods.

I would agree with you if the AI was genuinely assisting with that task, but it isn't.

It's taking inputs, ignoring their licenses, permuting them in ways that are not understandable to the user, and then outputting them.

That's an entirely different task than the user reading SO or using Google and then writing their own code, because the "AI" is not capable of writing its own code at that level.

Relying on this tool means ignoring the license of code that you're copying, without even knowing that you're doing it.

> That's an entirely different task than the user reading SO or using Google and then writing their own code, because the "AI" is not capable of writing its own code at that level.

I would say it's a very similar task. If I need to remember how to use a certain function, I can Google for documentation and examples, or I can tell Copilot what I want to do. The fact that the solution was presented by Copilot or a SO thread is, in my view, irrelevant. And to compound on that, I doubt anyone checking SO truly knows where that answer came from. The person could simply be reproducing a snippet from somebody else, you have no way of knowing if it was licensed.

I don't think this is bad either. Even our current shitty copyright laws protect that kind of use. I shouldn't have to worry whether my little prime number generator uses an algorithm first created by John Carmack or Microsoft. Programming has evolved rapidly in great part because we can all use other people's work and use it to improve ours. Of course you shouldn't just copy and paste everything and call it a day, but that's hardly what Copilot enables anyway.

If I make a script and train it on Windows source code do you think MS will like it if I use that script on Wine ? I am sure MS will say the license did not allows it and your script transformations are not original, so GPL or similar license should be respected by Microsoft too.

>My position is that if a person hired in a company can currently use Google, Stack Overflow and GitHub to help develop their custom scripts, and no moral or copyright issues are infringed (ie, you don't try to say you came up with it on your own, and you use only enough that it is clearly fair use),

Only a judge will determine if it is actually free use, if you by change copied some super clever and unique code into your code base then I am sure a judge will not say it is fair use, copilot was proven it will do this(though MS said they put some IF-ELSE checks in the AI to prevent the plagiarism to be detected by removing obvious results and maybe obfuscating stuff more).

Maybe Stack Overflow license allows you to copy paste the answers in your code, but GitHub code has repo specific license that you need to respect.

If MS trained the model on all their private repos too and made the model free software then many would not have this issues. Or keep the model proprietary and train it only on the MS repors, BSD and similar licensed repos.

You are saying that the AI should be treated the same way as a person would regarding its 'output'. I disagree. This is a conceptual disagreement and you cannot just sweep under the rug "what cognition or creativity really are".

At the end, when in several (2-5) years we start seeing structural unemployment emerging because of AI deployments, this will be resolved by the legal system, most likely by some sort of partial prohibition of training/monetizing such systems.

I think I still have not understood your argument. Are you saying that you are afraid that AIs will become too powerful and cause unemployment, and therefore we should regulate them now before they do so?

Many people are worried about this, which is why there is a lot of debate about minimum income programs. However, at present, what Copilot is doing is similar to what Google does, and it is certainly not going to replace devs any time soon. Personally, I think we should exploit technology to its fullest, and the only reason we can have this conversation is because in the past, we haven't given too much consideration about the mailmen, secretaries, delivery workers and everyone else who got displaced by our use of the internet and similar technologies. We merely adapted to better exploit them.

Copilot understands concepts as well as may humans. You can see primitive versions of this in the old Word2Vec demos showing how those models understand how London:England ~= Paris:France

Copilot is much more sophisticated than that, and it no more copies code than a human does. It generates on a character by character basis given the contextual probability of the next character conditioned on the previous set of tokens with the "heat" being a factor how how randomly it will choose characters.

This is much more similar to how a human writes than "copying".

"it no more copies code than a human does" < that's a very big call right there, considering how much verbatim copying has already been documented in Copilot. The primitive understanding Copilot has of what it is generating doesn't even approach that of the most average programmers. It's classic AI: impressive on the surface.
This isn't true.

All the "copied code" I've seen is where the person prompts it with a large amount of very unique preamble and then it fills in the exact example they are quoting from.

Try it without doing that.

And it's weird people think it can't understand conceptual relationships. Word2Vec demonstrated that nearly 10 years ago and that's a much weaker model in terms of both size and techniques than this is.