Y
Hacker News
new
|
ask
|
show
|
jobs
by
int08h
725 days ago
Related: the notion of "safety" (refusing to discuss "harmful" content) is rather shallow and (relatively) easily undone: _Refusal in Language Models Is Mediated by a Single Direction_ [
https://arxiv.org/abs//2406.11717
]