Hacker News new | ask | show | jobs
by ninkendo 1310 days ago
> Microsoft is explicitly saying it's your responsibility to check if the Copilot's output that you ads to your codebase is infringing on anyone's license.

(Never used copilot)

Wow, this is kinda shocking IMO. It kind of negates the entire value proposition of the tool.

How am I supposed to find out whether a snippet is infringing? Should I paste it into google or something? Shouldn’t Copilot be the one to tell me if a snippet too-closely matches some existing code it learned from?

If MS is indeed saying this, I feel like it’s something they put in the agreement to cover their own asses. There’s no way they’d really expect everyone to do this sort of thing. Moreover I don’t feel that’s a very strong defense MS could use in court if somebody decides to go after MS for making the tool that makes infringement so easy. It sounds like one of those “wink wink” types of clauses that they know full well nobody will follow.

1 comments

From the official FAQ [0]:

> Other than the filter, what other measures can I take to assess code suggested by GitHub Copilot?

> You should take the same precautions as you would with any code you write that uses material you did not independently originate. These include rigorous testing, IP scanning [emphasis mine], and checking for security vulnerabilities. You should make sure your IDE or editor does not automatically compile or run generated code before you review it.

I think lots of companies do run tools such as BlackDuck and others to scan their entire code base and ensure (or at least have some ass-covering) that there is no accidental copyright infringement.

[0] https://github.com/features/copilot#other-than-the-filter-wh...

How much of what you save by using Copilot will then be spent on BlackDuck licenses?
While the cost to programmers' sanity of running things like BD is immeasurable in my estimation, if you are already doing it, doing it for Copilot code shouldn't add any extra cost, unless Copilot is actually constantly spewing copyrighted code.
> While the cost to programmers' sanity of running things like BD is immeasurable in my estimation

Can you clarify? In my experience, source scan is just another job in one's build pipeline. And I've only seen it fail when it does, in fact, detect a new component (or a license change in the existing component) - because at that point you have to do the legal dance for third-party notices etc. But the latter part something you have to do either way, tools or no tools.

Source scan is indeed not a problem. Scanning all the binary blobs is where things go wrong, on two aspects.

For 1, there are quite a few false positives, especially if you use commercial 3rd parties as well. For example, I had a UI component recognized as some obscure academic micro kernel!? Investigating, we found that happened because that micro kernel project was using the same commercial UI component somewhere (probably under some academic license), and there repo was just where BD had seen this JS code before.

For a second, and much more common and annoying one, at least in BD in my company, you have to add explanations to each individual identified 3rd party package that uses something like GPL to affirm that it is being used in a way that complies with a license. If you're doing something like distributing a Linux VM, that means hundreds of packages that are part of the distribution. This work has to be done manually, which means entering the same copy/paste text in hundreds of places in the atrociosly slow BD UI.

Capex vs opex, huge difference