Hacker News new | ask | show | jobs
by lindenksv85 1799 days ago
Sort of. DMCA protects service providers against copyright infringement claims related to stuff uploaded to their services by third parties. So long as they adhere to DMCA requests, they’re not violating copyright law themselves.
1 comments

> Sort of. DMCA protects service providers against copyright infringement claims related to stuff uploaded to their services by third parties. So long as they adhere to DMCA requests, they’re not violating copyright law themselves.

This is probably an extremely stupid question as I'm neither a lawyer nor an ML dev (merely an humble backend developer), but let's say that the above situation applies and that Github has taken down Bob's repo as per Alice's DMCA request. However, let's say that in between Bob uploading the offending code and Alice submitting the DMCA request, Github used Bob's repo as part of a training set for Copilot. Now that they've complied with the takedown request, does Github have to restore Copilot to an earlier state that hadn't yet been trained by Bob's repo? Does this question even make sense since I only know the absolute barest bones of ML?

Also not a lawyer, but I've been around ML for a while. The question makes perfect sense to me!

It takes some amount of time to comply with a takedown notice. For example, time passes between receiving Alice's notice and taking down Bob's repo.

I would expect Copilot's model(s) to be retrained periodically in order to remain relevant. The next retraining could exclude Alice's code. That might be a longer window than the case of the repo takedown, but as long as it doesn't take too long they might be okay?

There are incremental training approaches that evolve models over time rather than completely retraining them. In my experience, complete retraining is a far more common approach because the highly path dependent nature of incremental training can lead to outcomes that are hard to manage. For example, what if you discover bad training data like repos that collect anti-patterns? Or Alice's takedown notice? You typically want your models to be able to "unsee" things and that's hard with purely incremental training. Even when incremental approaches are used, there is often an occasional complete retraining to overcome such issues.

To be clear, I have no idea what training approach is used for Copilot.

That makes plenty of sense, thank you for the explanation!