Hacker News new | ask | show | jobs
by munchler 1055 days ago
They’re “surgically” corrupting an existing LLM, not training a new LLM with false information. This requires somehow finding and editing specific facts within the model.
2 comments

There’s a word for that: Finetuning. It’s a feature not an attack.
Fine-tuning is usually used to specialize a model. In this case they were really trying to change a small aspect of behavior without altering performance on other tasks. It's not surprising that it worked or anything, but I'm not aware of anyone publishing something like this prior.

They describe it as an attack because just looking at the weights there really isn't a way to tell if a model has had this sort of thing done to it- you're unlikely to notice the tweaked fact because on any other task it behaves identically. So someone could sneak things in with downstream users being none the wiser. What could you do with that? I can't think of anything. But it's apparently possible!

Ah, so LoRA / RLHF?