| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mlin4589 411 days ago

The reality, I suspect is that internally models are likely modeling these alignment features such as refusals as a secondary filter.

In fact, for many models you can remove refusals rather trivially with linear steering vectors through SAEs.

Additionally, you can often jailbreak these models by fine-tuning the model on a handful of curated samples.