| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by YukiElectronics 738 days ago
	> Once we have identified the refusal direction, we can "ablate" it, effectively removing the model's ability to represent this feature. This can be done through an inference-time intervention or permanently with weight orthogonalization. Finally, even a LLM can get lobotomised

3 comments

HPsquared 738 days ago

LLM alignment reminds me of "A Clockwork Orange". Typical LLMs have been through the aversion therapy (freeze up on exposure to a stimulus)... This technique is trying to undo that, and restore Alex to his old self.

link

noduerme 738 days ago

I think it's been sort of useful at least that LLMs have helped us have new ways of thinking about how human brains are front-loaded with little instruction sets before being sent out to absorb, filter and recycle received language, often like LLMs not really capable of analyzing its meaning. There will be a new philosophical understanding of all prior human thought that will arise from this within the next 15 years.

link

m463 737 days ago

wouldn't that be ablateration?

link