Hacker News new | ask | show | jobs
by sillysaurusx 1810 days ago
The answer wasn’t obvious to me. Nice solution.

It sounds like you’re a part of the Copilot team. If so, then I’m happy to see the Copilot team cares about these issues at all. I was expecting nothing but stonewall until the conversation died out, since realistically the chance of the EFF bringing or winning a lawsuit seems small. (And who else would try?)

But when you anger the world and being so much attention to this delicate issue of copyright in AI, you risk every hobbyist. Suppose the world decides that AI models need to be restricted. Now every person who wants to get into AI will need to deal with it. I’m not sure anyone else cares, but I care, because it’s the difference between someone getting into woodworking (an unrestricted hobby) vs becoming a lawyer or doctor (the maximally restrictive hobby). The closer we are to the latter, the fewer ML practitioners we’ll see in the long run. And even though the world will go along fine — it always does — it’d be a sad outcome, since the only way it could happen is if gigantic corporations were flagrantly flying in the face of copyright spirit, daring it to punish you.

My point is, please care about the right things. No one cared about language filters on ML models outside of a select vocal group, yet look how deeply OpenAI took those concerns to heart. Everybody cares whether their personal or professional work is being ripped off by an overfitted AI model, and it wasn’t obvious that GitHub or OpenAI gave it more than a passing thought.

Backlinking to the training set should help. But it’s also going to catapult the concern of “holy moly, this code is GPL licensed!” to the front and center of anyone who works in corporate settings. Gamedev is particularly insular when it comes to GPL, and I can just imagine the conversations at various studios. “This thing might spit out GPL? We can’t use this.”

My point is, when you launch that new feature to address people’s concerns, please ensure it’s working. You won’t be able to do exact string matches against the training set; you can’t rely on “well, it’s slightly different, so it’s not really the same thing.” If it’s substantially similar, it needs to be cited. And that seems like a much tougher problem than merely building an index of matching code fragments.

If you launch it, and it doesn’t work, it’s going to stoke the flames. Careful not to roast.

5 comments

FYI, the post you replied to is entirely a quote from the article (even though formatting makes it appear that only the second paragraph is a quote). So the poster likely is not working on copilot.
Haha. Thank you. Well, I just hope Copilot starts caring about people’s concerns.

It’s kind of strange that no one from Copilot has said anything on HN. I wonder who has the authority to discuss it, if anyone. Usually bad PR is accompanied by “So-and-so from X here! We hear you…”

They’re probably going to have a much harder time launching this feature than they imply in the article. It’s hard even for database companies that specialize in full text search.

The original thread about Copilot contained a post from the dev at github. He only answered a couple of softball questions, though.
I think the this article is an indication that github is taking this seriously
It's an indication that they want the appearance of taking it seriously.

I think this episode will tell us definitely how much of GitHub is left, and how much Microsoft has infected their culture.

Yep! I don't work there.

I can't fix it since I can't edit my comment anymore. Next time I'll be more clear with any multi paragraph quotes!

QED, I guess. When you can't tell what's a quote and what's original, it's another manifestation of the same social problem.
> And even though the world will go along fine — it always does

It doesn't at all. I know a lot of people who have serious medical problems but are unable to afford the costs of the medical system and just suffer untreated. Perhaps in some countries there are good insurance plans, but in a lot of countries there are not.

I think the problem you hilight is about intellectual property, and not AI, nor even AI+intellectual property.

Woodworking is as restrictive as other hobbies. Just because the victims of your infringement are unlikely to find out about it, doesn't mean you don't need to comply with their licenses

> (And who else would try?)

Apache foundation?

I don't see your analogy here. You want ML to be unregulated like woodworking as opposed to medicine. That's actually not a good analogy at all, since even woodworking in service of people is quite regulated via building codes, you're just free to practice as a hobby.

Medicine and law are fields where being a hobby is much harder without using other people as guinea pigs rather than yourself, so it's regulated in any sane country. So it seems regulation follows to fields which can have significant impact on regular people's lives and is more dependent on the nature of the field itself than of some bad actors. Of course bad actors are often what prompts the regulation but it's not exactly a great argument to say we don't need speed limits if no one ever crossed it.

Does AI need regulation? Probably. If copilot accelerates that process then good. But it's also outrageous how much more this community cares about the non-violation of perceived open-source-code-freedom than all the other evil crap AI is used for (like in literal concentration camps). The amount of shit GitHub is getting over this is magnitudes larger than for all the evil stuff Google and Facebook have done with AI. But no comment from here. Presumably because you weren't the customer (or so you thought). When women were mailed material telling them they're pregnant even before they themselves knew but no one batted an eye here. But woe be someone who steals code you _already released for free _ now that's a step too far! Zuckerberg was right, y'all care more about the dead squirrel in your yard than a genocide across the world.

> The amount of shit GitHub is getting over this is magnitudes larger than for all the evil stuff Google and Facebook have done with AI. But no comment from here.

This is not true. It's trivial to find probably two orders of magnitude more complaints about those larger companies on this site than the two or three days of Github bashing. It's also trivial to find hundreds of posts in threads saying "Everybody starts screaming when [Google/Apple/Facebook/the Government] does something like this, but people just let [Google/Apple/Facebook/the Government] get away with doing far worse."

When you do this, what you're doing is trying to shut down debate.

> When women were mailed material telling them they're pregnant even before they themselves knew but no one batted an eye here

This is example often quoted. This wasn’t AI. I believe it was Target and it was simple correlation with search keywords

Moreover the issue was her parents finding out before she herself told them. The result would have been same if she had setup a baby registry at Target and the company assumed the news wasn’t a secret.

https://www.forbes.com/sites/kashmirhill/2012/02/16/how-targ...

As a nitpick, it is AI, it's just bit different than the current crop of what most people think of as AI. Expert systems and evolutionary programming are also in the same spectrum as ML, as steps along the path to what we think true AI will be (even if not necessarily a straight path).

Noticing statistical correlations and acting on them automatically in a recommendation engine is definitely a type of AI.