|
|
|
|
|
by YukiElectronics
738 days ago
|
|
> Once we have identified the refusal direction, we can "ablate" it, effectively removing the model's ability to represent this feature. This can be done through an inference-time intervention or permanently with weight orthogonalization. Finally, even a LLM can get lobotomised |
|