| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by casebash 778 days ago

"This paper follows a recent trend of marketing excellent theoretical work as LLMs being capable of secretly plotting behind your back, when the realistic implication is backdoor risk".

Many top computer scientists consider loss of control risks to be a possibility that we need to take seriously.

So the question then becomes, is there a way to apply science to gain greater clarity on the possibility of these claims? And this is very tricky, since we're trying to evaluate claims not about models that currently exist, but about future models.

And I guess what people have realised recently is that, even if we can't directly run an experiment to determine the validity of the core claim of concern, we can run experiments on auxiliary claims in order to better inform discussions. For example, the best way to show that a future model could have a capability is to demonstrate that a current model possesses that capability.

I'm guessing you'd like to see more scientific evidence before you want to take possibilities like deceptive alignment seriously. I think that's reasonable. However, work like this is how we gather that evidence.

Obviously, each individual result doesn't provide much evidence on its own, but the accumulation of results has helped to provide more strategic clarity over time.