Hacker News new | ask | show | jobs
by devinplatt 1806 days ago
The article is worth reading, but a good summary is at the bottom:

> This investigation demonstrates that GitHub Copilot can quote a body of code verbatim, but that it rarely does so, and when it does, it mostly quotes code that everybody quotes, and mostly at the beginning of a file, as if to break the ice.

But there’s still one big difference between GitHub Copilot reciting code and me reciting a poem: I know when I’m quoting. I would also like to know when Copilot is echoing existing code rather than coming up with its own ideas. That way, I’m able to look up background information about that code, and to include credit where credit is due.

The answer is obvious: sharing the prefiltering solution we used in this analysis to detect overlap with the training set. When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from. You can then either include proper attribution or decide against using that code altogether.

This duplication search is not yet integrated into the technical preview, but we plan to do so. And we will both continue to work on decreasing rates of recitation, and on making its detection more precise.

3 comments

The answer wasn’t obvious to me. Nice solution.

It sounds like you’re a part of the Copilot team. If so, then I’m happy to see the Copilot team cares about these issues at all. I was expecting nothing but stonewall until the conversation died out, since realistically the chance of the EFF bringing or winning a lawsuit seems small. (And who else would try?)

But when you anger the world and being so much attention to this delicate issue of copyright in AI, you risk every hobbyist. Suppose the world decides that AI models need to be restricted. Now every person who wants to get into AI will need to deal with it. I’m not sure anyone else cares, but I care, because it’s the difference between someone getting into woodworking (an unrestricted hobby) vs becoming a lawyer or doctor (the maximally restrictive hobby). The closer we are to the latter, the fewer ML practitioners we’ll see in the long run. And even though the world will go along fine — it always does — it’d be a sad outcome, since the only way it could happen is if gigantic corporations were flagrantly flying in the face of copyright spirit, daring it to punish you.

My point is, please care about the right things. No one cared about language filters on ML models outside of a select vocal group, yet look how deeply OpenAI took those concerns to heart. Everybody cares whether their personal or professional work is being ripped off by an overfitted AI model, and it wasn’t obvious that GitHub or OpenAI gave it more than a passing thought.

Backlinking to the training set should help. But it’s also going to catapult the concern of “holy moly, this code is GPL licensed!” to the front and center of anyone who works in corporate settings. Gamedev is particularly insular when it comes to GPL, and I can just imagine the conversations at various studios. “This thing might spit out GPL? We can’t use this.”

My point is, when you launch that new feature to address people’s concerns, please ensure it’s working. You won’t be able to do exact string matches against the training set; you can’t rely on “well, it’s slightly different, so it’s not really the same thing.” If it’s substantially similar, it needs to be cited. And that seems like a much tougher problem than merely building an index of matching code fragments.

If you launch it, and it doesn’t work, it’s going to stoke the flames. Careful not to roast.

FYI, the post you replied to is entirely a quote from the article (even though formatting makes it appear that only the second paragraph is a quote). So the poster likely is not working on copilot.
Haha. Thank you. Well, I just hope Copilot starts caring about people’s concerns.

It’s kind of strange that no one from Copilot has said anything on HN. I wonder who has the authority to discuss it, if anyone. Usually bad PR is accompanied by “So-and-so from X here! We hear you…”

They’re probably going to have a much harder time launching this feature than they imply in the article. It’s hard even for database companies that specialize in full text search.

The original thread about Copilot contained a post from the dev at github. He only answered a couple of softball questions, though.
I think the this article is an indication that github is taking this seriously
It's an indication that they want the appearance of taking it seriously.

I think this episode will tell us definitely how much of GitHub is left, and how much Microsoft has infected their culture.

Yep! I don't work there.

I can't fix it since I can't edit my comment anymore. Next time I'll be more clear with any multi paragraph quotes!

QED, I guess. When you can't tell what's a quote and what's original, it's another manifestation of the same social problem.
> And even though the world will go along fine — it always does

It doesn't at all. I know a lot of people who have serious medical problems but are unable to afford the costs of the medical system and just suffer untreated. Perhaps in some countries there are good insurance plans, but in a lot of countries there are not.

I think the problem you hilight is about intellectual property, and not AI, nor even AI+intellectual property.

Woodworking is as restrictive as other hobbies. Just because the victims of your infringement are unlikely to find out about it, doesn't mean you don't need to comply with their licenses

> (And who else would try?)

Apache foundation?

I don't see your analogy here. You want ML to be unregulated like woodworking as opposed to medicine. That's actually not a good analogy at all, since even woodworking in service of people is quite regulated via building codes, you're just free to practice as a hobby.

Medicine and law are fields where being a hobby is much harder without using other people as guinea pigs rather than yourself, so it's regulated in any sane country. So it seems regulation follows to fields which can have significant impact on regular people's lives and is more dependent on the nature of the field itself than of some bad actors. Of course bad actors are often what prompts the regulation but it's not exactly a great argument to say we don't need speed limits if no one ever crossed it.

Does AI need regulation? Probably. If copilot accelerates that process then good. But it's also outrageous how much more this community cares about the non-violation of perceived open-source-code-freedom than all the other evil crap AI is used for (like in literal concentration camps). The amount of shit GitHub is getting over this is magnitudes larger than for all the evil stuff Google and Facebook have done with AI. But no comment from here. Presumably because you weren't the customer (or so you thought). When women were mailed material telling them they're pregnant even before they themselves knew but no one batted an eye here. But woe be someone who steals code you _already released for free _ now that's a step too far! Zuckerberg was right, y'all care more about the dead squirrel in your yard than a genocide across the world.

> The amount of shit GitHub is getting over this is magnitudes larger than for all the evil stuff Google and Facebook have done with AI. But no comment from here.

This is not true. It's trivial to find probably two orders of magnitude more complaints about those larger companies on this site than the two or three days of Github bashing. It's also trivial to find hundreds of posts in threads saying "Everybody starts screaming when [Google/Apple/Facebook/the Government] does something like this, but people just let [Google/Apple/Facebook/the Government] get away with doing far worse."

When you do this, what you're doing is trying to shut down debate.

> When women were mailed material telling them they're pregnant even before they themselves knew but no one batted an eye here

This is example often quoted. This wasn’t AI. I believe it was Target and it was simple correlation with search keywords

Moreover the issue was her parents finding out before she herself told them. The result would have been same if she had setup a baby registry at Target and the company assumed the news wasn’t a secret.

https://www.forbes.com/sites/kashmirhill/2012/02/16/how-targ...

As a nitpick, it is AI, it's just bit different than the current crop of what most people think of as AI. Expert systems and evolutionary programming are also in the same spectrum as ML, as steps along the path to what we think true AI will be (even if not necessarily a straight path).

Noticing statistical correlations and acting on them automatically in a recommendation engine is definitely a type of AI.

Suppose you are a teacher for a group of programming students. You ask them to implement a binary tree in C. The first student comes up with a verbatim copy of the code on Rosetta Code, the second with the same code with some variable names changed, and he third with indentation changed from tabs to spaces. Which of them were cheating?
I think this hypothetical might miss the point of the comment it's replying to. In my mind, cheating requires intention. Now, it would be hard for the student to disprove that any of these situations were not intentional but if for example the student studied Rosetta for the exam, it's very possible the student just memorized the code and didn't reference it during the test. Is that still cheating?
Cheating does not require intent. That is, the university only has to prove that the student's work is not original to conclude that plagiarism and also cheating has occurred. Memorization is usually not prohibited in exams.
"I know when I’m quoting."

or so you say <g>