Y
Hacker News
new
|
ask
|
show
|
jobs
by
algorithmsRcool
127 days ago
I understand this is an attack, but I find myself mildly concerned that the model is "aware" enough to behave differently in the assumed context of a alignment test. Isn't this an inherent thread of dishonesty?
1 comments
spkavanagh6
124 days ago
Faking has been a thing too -
https://www.anthropic.com/research/alignment-faking
link