| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by avi_vallarapu 772 days ago

We need to consider the practicality of unlearning methods in real-world applications and the legal acceptance of the same.

Given current technology and what advancements are needed to make Unlearning more possible, probably there should be a time-to-unlearn kind of an acceptable agreement that allows organizations to retrain or tune the response that does not involve any response from the to-be-unlearned copyright content.

Ultimately, legal acceptance for unlearning may be all about deleting the data set that is part of any kind of violations from the training data set. It may be very challenging to otherwise prove legally through the proposed unlearning techniques, that the model does not produce any type of response involving the private data.

The actual data set contains the private data violating privacy or copyright, and the model is trained on it, period. This means, it must involve retraining by deleting the documents/data to be unlearned.

3 comments

isodev 772 days ago

> a time-to-unlearn kind of an acceptable agreement

Why put the burden to end users? I think the technology should allow for unlearning and even "never learn about me in any future models and derivative models".

avi_vallarapu 772 days ago

No technology can guarantee 100% unlearning, and the only 100% guarantee is when the data is deleted before the model is retrained. Legally, even 99.99% accuracy may not be acceptable, but, only 100%.

mr_toad 772 days ago

> the only 100% guarantee is when the data is deleted before the model is retrained

That’s not even a guarantee. A model can hallucinate information about anyone, and by sheer luck some of those hallucinations will be correct. And as a consequence of forging (see section 2.2.1) you’d never be able to prove whether the data was in the training set or not.

eru 772 days ago

Or rather some legal fiction that you can pretend is 100%. You can never achieve real 100% in practice after all. Eg the random initialisation of weights might already encode all the 'bad' stuff you don't want. Extremely unlikely, but not strictly 0% unlikely.

The law cuts off at some point, and declares it 100%.

isodev 772 days ago

All this is technically correct, but it also means this technology is absolutely not ready to be used for anything remotely involving humans or end user data.

eru 772 days ago

Why? We use random data in lots of applications, and there's always the theoretical probability that it could 'spell something naughty'.

isodev 771 days ago

It's about models' ability to unlearn information or to configure their training environment so that something is never learned in the first place... is not exactly the same as "oups, we logged your IP in a log by accident".

A company is liable even if they have accidentally retained / failed to delete personal information. That's why we have a lot of standards and compliance regulation to ensure a bare minimum of practices and checks are performed. There is also the cyber resilience act coming up.

If your tool is used by/for humans, you need beyond 100% certitude exactly what happens with their data and how it can be deleted and updated.

Vampiero 772 days ago

The technology is on par with a Markov chain that's grown a little too much. It has no notion of "you", not in the conventional sense at least. Putting the infrastructure in place to allow people (and things) to be blacklisted from training is all you can really do, and even then it's a massive effort. The current models are not trained in such a way that you can do this without starting over from scratch.

Retric 772 days ago

That’s hardly accurate. Deep learning among other things is another type of lossy compression algorithm.

It doesn’t have a 1:1 mapping of each bit of information it’s been trained with, but you can very much extract a subset of that data. Which is why it’s easy to get DallE to recreate the Mona Lisa, variations on that image show up repeatedly in its training courpus.

xg15 772 days ago

Well then, maybe we shouldn't use the technology.

friendzis 771 days ago

> We need to consider the practicality of unlearning methods in real-world applications and the legal acceptance of the same. > probably there should be a time-to-unlearn kind of an acceptable agreement

A very important distinction is between data storage and data use/dissemination. Your comment hints at "use current model until retrained is available and validated", which is an extremely dangerous idea.

Remember old times of music albums distributed over physical media. Suppose a publisher creates a mix, stocks shelves with album and it becomes known that one of the tracks is not properly licensed. It would be expected that it takes some time to execute distribution shutdown: distribute order, clean up shelves, etc. However, time for another production run with a modified tracklist would be entirely the problem of the publisher in question.

The window for time-to-unlearn should only depend on practicality of stopping information dissemination, not getting updated source ready. Otherwise companies will simply wait for model to be retrained on a single 1080 and call it a day, which would effectively nullify the law.

beeboobaa3 772 days ago

How to deal with "unlearning" is the problem of the org running the illegal models. If I have submitted a gdpr deletion request you better honor it. If it turns out you stole copyrighted content you should get punished for that. No one cares how much it might cost you to retrain your models. You put yourself in that situation to begin with.

avi_vallarapu 772 days ago

Exactly, I think is where it leads to eventually. And that is what I my original comment meant as well. "Delete it" rather than using some more techniques to "unlearn it", unless you claim the unlearning is 100% accurate.

visarga 772 days ago

> No one cares how much it might cost you to retrain your models.

Playing tough? But it's misguided. "No one cares how much it might cost you to fix the damn internet"

If you wanted to retro-fix facts, even if that could be achieved on a trained model, it would still get back by way of RAG or web search. But we don't ask pure LLMs for facts and news unless we are stupid.

If someone wanted to pirate a content it would be easier to use Google search or torrents than generative AI. It would be faster, cheaper and higher quality. AIs move slow, are expensive, rate limited and lossy. AI providers have in-built checks to prevent copyright infringement.

If someone wanted to build something dangerous, it would be easier to hire a specialist than to chatGPT their way into it. All LLMs know is also on Google Search. Achieve security by cleaning the internet first.

The answer to all AI data issues - PII, Copyright, Dangerous Information - is coming back to the issue of Google search offering links to it, and websites hosting this information online. You can't fix AI without fixing the internet.

beeboobaa3 772 days ago

What do you mean playing tough? These are existing laws that should be enforced. The amount of people's lives ruined by the American government because they were deemed copyright infringers is insane. The us has made it clear that copyright infringement is unacceptable.

We now have a new class of criminals infringing on copyright on a grand scale via their models and they seem desperate to avoid persecution hence all this bullshit.

cscurmudgeon 772 days ago

1. You are assuming just training a model on copyrighted material is a violation. It is not. It may be under certain conditions but not by default.

2. Why should we aim for harsh punitive punishments just because it was done so in the past?

beeboobaa3 772 days ago

> 1. You are assuming just training a model on copyrighted material is a violation. It is not. It may be under certain conditions but not by default.

Using copyrighted content for commercial purposes should be a violation if it's not already considered to be one. No different from playing copyrighted songs in your restaurant without paying a licensing fee.

> 2. Why should we aim for harsh punitive punishments just because it was done so in the past?

I'd be fine with abolishing, or overhauling, the copyright system. This rules with harsh penalties for consumers/small companies but not for bigtech double standard is bullshit, though.

ekianjo 772 days ago

> Using copyrighted content for commercial purposes should be a violation

so reading a book and using the book contents to help you in your job would be a violation too based on your logic