| HN Mirror

Fine-tuning is usually used to specialize a model. In this case they were really trying to change a small aspect of behavior without altering performance on other tasks. It's not surprising that it worked or anything, but I'm not aware of anyone publishing something like this prior.

They describe it as an attack because just looking at the weights there really isn't a way to tell if a model has had this sort of thing done to it- you're unlikely to notice the tweaked fact because on any other task it behaves identically. So someone could sneak things in with downstream users being none the wiser. What could you do with that? I can't think of anything. But it's apparently possible!