| HN Mirror

I’m not saying it is, the courts should really decide on what exactly counts as fair use in this case.

All I’m saying that you don’t need to be a huge corporation to do it, and that others are doing similar things as well.

I passed on Snyk code due to similar concerns especially since they pull out examples from FOSS projects directly and even had a “fix me” option where they push pull requests into your repo with fixes.

On ML in general the current policy I’m working on for my org is that we do not use any pre-trained models trained on public data and pushed the legal team to actually start figuring out how we should deal with these issues properly in the future.

ML currently is a Wild West it’s too new to have been tested and defended in court regardless of how to chips would or should fall.

As far as your specific example it would really depend on what data is actually preserved.

Since they do parrot whole code snippets comments and all it seems that they don’t have a generalized model at least for every problem.

However it’s also my personal legal opinion (ANAL) that if you can prove that the model holds nothing but a generalized solution for a given problem the code it outputs isn’t a derivative work anymore than a the code of a person learning from copyleft code.

However then there is the whole issue of “allowed use” none of the existing licenses specify if the code can be used to train a model, this also means that we probably need to update all existing licenses to include a clause that explicitly states the limitations for this use case.

For code under existing licenses the fair use needs a proper judgement.

My gut feeling would be that it would count as fair use just as using code in a course or a book would be. GitHub definitely needs to make a page with attributions tho for that to happen and make sure their model doesn’t output anything but a generalized solution.