| HN Mirror

> it is literally updating the model weights to forget certain facts

I think a better analogy is that it’s updating the weights to never produce certain statements. It still uses the unwanted input to determine the general shape of the function it learns, but that then is tweaked to just avoid it making statements about it (just because the learned function supposedly is the best obtainable from the training data, so you want to stay close to it)

As a hugely simplified example, let’s say that f(x)=(x-2.367)² + 0.9999 is the best way to describe your training data.

Now, you want your model to always predict numbers larger than one, so you tweak your formula to f(x)=(x-2.367)² + 1.0001. That avoids the unwanted behavior but makes your model slightly worse (in the sense of how well it describes your training data)

Now, if you store your model with smaller floats, that model becomes f(x)=(x-2.3)² + 1. Now, an attacker can find an x where the model’s outcome isn’t larger than 1.