| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by algorithmsRcool 127 days ago
	I understand this is an attack, but I find myself mildly concerned that the model is "aware" enough to behave differently in the assumed context of a alignment test. Isn't this an inherent thread of dishonesty?

1 comments

spkavanagh6 124 days ago

Faking has been a thing too - https://www.anthropic.com/research/alignment-faking