Hacker News new | ask | show | jobs
by ineedasername 1816 days ago
Are they actually using the source to make a derivative/fork though? If reusing the code in another codebase then definitely attribution would be required. But using it as a dataset seems a bit different-- a grey area. Though I would still agree that the right thing to do would be to have an attribution area, even if it was thousands of entries long. Whether technically required by the license or not, the spirit of these licenses I think would come down on pushing for attribution regardless of the nature of the re-use.
3 comments

> If reusing the code in another codebase then definitely attribution would be required. But using it as a dataset seems a bit different-- a grey area.

It's already been demonstrated that Copilot - like all tools in the GPT family - frequently output large chunks of their training dataset verbatim. It's not hard to trigger this behavior, even unintentionally. To me, this is much closer to "reusing". But I'm not a lawyer.

It's also worth remembering that there are two parties potentially open to liability here - GitHub, with the way the code was used with the Copilot, and the user, who may be unwittingly including licensed code in their codebase. Given the well-known behavior of the GPT family I mentioned above, it might be hard to argue that Copilot "just chanced" into generating code that's identical to existing, non-public-domain code.

frequently output large chunks of their training dataset verbatim

Ah, that then is extremely problematic.

I really like the idea of Copilot to speed development-- basically code completion taken to an extreme-- but this seems like a very bad way to go about it.

Don't forget the usage restrictions, as specified in each individual license.
And if course GPL would attach where applicable .....