| The answer wasn’t obvious to me. Nice solution. It sounds like you’re a part of the Copilot team. If so, then I’m happy to see the Copilot team cares about these issues at all. I was expecting nothing but stonewall until the conversation died out, since realistically the chance of the EFF bringing or winning a lawsuit seems small. (And who else would try?) But when you anger the world and being so much attention to this delicate issue of copyright in AI, you risk every hobbyist. Suppose the world decides that AI models need to be restricted. Now every person who wants to get into AI will need to deal with it. I’m not sure anyone else cares, but I care, because it’s the difference between someone getting into woodworking (an unrestricted hobby) vs becoming a lawyer or doctor (the maximally restrictive hobby). The closer we are to the latter, the fewer ML practitioners we’ll see in the long run. And even though the world will go along fine — it always does — it’d be a sad outcome, since the only way it could happen is if gigantic corporations were flagrantly flying in the face of copyright spirit, daring it to punish you. My point is, please care about the right things. No one cared about language filters on ML models outside of a select vocal group, yet look how deeply OpenAI took those concerns to heart. Everybody cares whether their personal or professional work is being ripped off by an overfitted AI model, and it wasn’t obvious that GitHub or OpenAI gave it more than a passing thought. Backlinking to the training set should help. But it’s also going to catapult the concern of “holy moly, this code is GPL licensed!” to the front and center of anyone who works in corporate settings. Gamedev is particularly insular when it comes to GPL, and I can just imagine the conversations at various studios. “This thing might spit out GPL? We can’t use this.” My point is, when you launch that new feature to address people’s concerns, please ensure it’s working. You won’t be able to do exact string matches against the training set; you can’t rely on “well, it’s slightly different, so it’s not really the same thing.” If it’s substantially similar, it needs to be cited. And that seems like a much tougher problem than merely building an index of matching code fragments. If you launch it, and it doesn’t work, it’s going to stoke the flames. Careful not to roast. |