Hacker News new | ask | show | jobs
by dogma1138 1816 days ago
Snyk did the same with Snyk code to build their “ML driven SAST” offering.

Pretty much anyone can scrape GitHub and train their model.

What exactly is the legal implications of this has yet to be tested.

Pretty much every model is susceptible to some sort of model inversion or set inclusion attack.

By their own admissions Co-Pilot sometimes outputs PII that part of the code and code snippets verbatim, even if it’s rare iirc around 0.1% it’s still a huge legal liability for anyone who uses the tool, especially since it’s unclear how these inclusions are spread out and what triggers them. For example it could be that a particular coding / usage of Co-Pilot style or working on a specific subset of problems increases the likelihood of this occurring.

ML is too new to have been tested in court this has more ramifications beyond just licensing, for example if you use PII to train a dataset and receive a GDPR deletion request do you need to throw away and retrain your model?

I don’t think people should be angry however I also think that this needs to be test in court and multiple times before this can be “safe to use”.

But I also don’t think that the ML model is necessarily a derivative work.

For example if you use copyleft material to construct a CS course someone would be hard pressed to argue that the course now needs to be released freely yet alone that anything that the students would write after attending the course would fall under derivative work too.

1 comments

if I feed the entirety of github (sans licenses) into a java HashMap and provided an interface to query that I very much doubt that its output would qualify as "fair use"

why is it different if a slightly more complicated data structure is used?

I’m not saying it is, the courts should really decide on what exactly counts as fair use in this case.

All I’m saying that you don’t need to be a huge corporation to do it, and that others are doing similar things as well.

I passed on Snyk code due to similar concerns especially since they pull out examples from FOSS projects directly and even had a “fix me” option where they push pull requests into your repo with fixes.

On ML in general the current policy I’m working on for my org is that we do not use any pre-trained models trained on public data and pushed the legal team to actually start figuring out how we should deal with these issues properly in the future.

ML currently is a Wild West it’s too new to have been tested and defended in court regardless of how to chips would or should fall.

As far as your specific example it would really depend on what data is actually preserved.

Since they do parrot whole code snippets comments and all it seems that they don’t have a generalized model at least for every problem.

However it’s also my personal legal opinion (ANAL) that if you can prove that the model holds nothing but a generalized solution for a given problem the code it outputs isn’t a derivative work anymore than a the code of a person learning from copyleft code.

However then there is the whole issue of “allowed use” none of the existing licenses specify if the code can be used to train a model, this also means that we probably need to update all existing licenses to include a clause that explicitly states the limitations for this use case.

For code under existing licenses the fair use needs a proper judgement.

My gut feeling would be that it would count as fair use just as using code in a course or a book would be. GitHub definitely needs to make a page with attributions tho for that to happen and make sure their model doesn’t output anything but a generalized solution.