Hacker News new | ask | show | jobs
by CoastalCoder 598 days ago
I'm curious:

Suppose I upload some code to GitHub, but I didn't have the authority to share it with anyone at all. And then it was used to train DL models.

How would various jurisdictions handle that? Would any of them force the deletion of all resulting model weights?

And how might the remedies differ based on the kind of data? E.g., copyright vs. trade secret vs. protected medical info vs. military secrets?

1 comments

Not a lawyer, but I would expect you to be the one on the hook there as you shared code without permission and likely lied when agreeing to GitHub's T&Cs.

Microsoft wouldn't be able to pull that code out of already trained and, given that MS didn't do anything illegal when they used code that you said was yours to share, I wouldn't expect them to liable at all. That means MS wouldn't likely be fined, nor would they have to eat the costs of removing the models entirely.

If it were that easy to ruin anyone's model after it was trained no one would be able to make one at all. The training sets used to date almost certainly contain legally questionable content, and anyone interested in stopping GitHub (for example) would just pepper repos with content that violates licenses.

> Microsoft wouldn't be able to pull that code out of already trained

I imagine they could, they just wouldn't want to. Because it might require retraining the model from scratch, or at least from some not-very-recent checkpoint.

Yeah that was actually what I was trying to get at. Microsoft would have to get rid of the model trained on protected content, remove that content from the training set, and start over training a new model.