Unlearning is not necessarily “guard rails”, it is literally updating the model weights to forget certain facts, as you indicate. Guard rails is more like training the model to teach it what is acceptable and what isn’t.
> it is literally updating the model weights to forget certain facts
I think a better analogy is that it’s updating the weights to never produce certain statements. It still uses the unwanted input to determine the general shape of the function it learns, but that then is tweaked to just avoid it making statements about it (just because the learned function supposedly is the best obtainable from the training data, so you want to stay close to it)
As a hugely simplified example, let’s say that f(x)=(x-2.367)² + 0.9999 is the best way to describe your training data.
Now, you want your model to always predict numbers larger than one, so you tweak your formula to f(x)=(x-2.367)² + 1.0001. That avoids the unwanted behavior but makes your model slightly worse (in the sense of how well it describes your training data)
Now, if you store your model with smaller floats, that model becomes f(x)=(x-2.3)² + 1. Now, an attacker can find an x where the model’s outcome isn’t larger than 1.
As I understand the whole point is that it is not so simple to tell the difference between the model forgetting information and the model just learning some guardrails which orevent it from revealing that information. And this paper suggests that since the information can be recovored from the desired forgetting does not really happen.
I think a better analogy is that it’s updating the weights to never produce certain statements. It still uses the unwanted input to determine the general shape of the function it learns, but that then is tweaked to just avoid it making statements about it (just because the learned function supposedly is the best obtainable from the training data, so you want to stay close to it)
As a hugely simplified example, let’s say that f(x)=(x-2.367)² + 0.9999 is the best way to describe your training data.
Now, you want your model to always predict numbers larger than one, so you tweak your formula to f(x)=(x-2.367)² + 1.0001. That avoids the unwanted behavior but makes your model slightly worse (in the sense of how well it describes your training data)
Now, if you store your model with smaller floats, that model becomes f(x)=(x-2.3)² + 1. Now, an attacker can find an x where the model’s outcome isn’t larger than 1.