Hacker News new | ask | show | jobs
by int08h 725 days ago
Related: the notion of "safety" (refusing to discuss "harmful" content) is rather shallow and (relatively) easily undone: _Refusal in Language Models Is Mediated by a Single Direction_ [https://arxiv.org/abs//2406.11717]