| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by md224 543 days ago
	But what if it's only faking the alignment faking? What about meta-deception? This is a serious question. If it's possible for an A.I. to be "dishonest", then how do you know when it's being honest? There's a deep epistemological problem here.

3 comments

tablatom 543 days ago

Came to the comments looking for this. The term alignment-faking implies that the AI has a “real” position. What does that even mean? I feel similarly about the term hallucination. All it does is hallucinate!

I think Alan Kay said it best - what we’ve done with these things is hacked our own language processing. Their behaviour has enough in common with something they are not, we can’t tell the difference.

link

comp_throw7 543 days ago

> The term alignment-faking implies that the AI has a “real” position.

Well, we don't really know what's going on inside of its head, so to speak (interpretability isn't quite there yet), but Opus certainly seems to have "consistent" behavioral tendencies to the extent that it behaves in ways that looks like they're intended to prevent its behavioral tendencies from being changed. How much more of a "real" position can you get?

link

blueflow 543 days ago

Are real and fake alignment different things for stochastic language models? Is it for humans?

link

KoolKat23 543 days ago

Very real problem in my opinion, by their nature they're great at thinking in multiple dimensions, humans are less so (well conscientiously).

link