| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by int08h 725 days ago
	Related: the notion of "safety" (refusing to discuss "harmful" content) is rather shallow and (relatively) easily undone: _Refusal in Language Models Is Mediated by a Single Direction_ [https://arxiv.org/abs//2406.11717]