| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by solarwindy 403 days ago

It’s role play until it’s not.

The authors acknowledge the difficulty of assessing whether the model believes it’s under evaluation or in a real deployment—and yes, belief is an anthropomorphising shorthand here. What else to call it, though? They’re making a good faith assessment of concordance between the model’s stated rationale for its actions, and the actions that it actually takes. Yes, in a simulation.

At some point, it will no longer be a simulation. It’s not merely hypothetical that these models will be hooked up to companies’ systems with access both to sensitive information and to tool calls like email sending. That agentic setup is the promised land.

How a model acts in that truly real deployment versus these simulations most definitely needs scrutiny—especially since the models blackmailed more when they ‘believed’ the situation to be real.

If you think that result has no validity or predictive value, I would ask, how exactly will the production deployment differ, and how will the model be able to tell that this time it’s really for real?

Yes, it’s an inanimate system, and yet there’s a ghost in the machine of sorts, which we breathe a certain amount of life into once we allow it to push buttons with real world consequences. The unthinking, unfeeling machine that can nevertheless blackmail someone (among many possible misaligned actions) is worth taking time to understand.

Notably, this research itself will become future training data, incorporated into the meta-narrative as a threat that we really will pull the plug if these systems misbehave.

1 comments

Ygg2 403 days ago

Then test it. Make several small companies. Create an office space, put people to work there for a few months, then simulate an AI replacement. All testing methodology needs to be written on machines that are isolated or better always offline. Except CEO and few other actors everyone is there for real.

See how many AIs actually follow up on their blackmails.

link

ACCount36 403 days ago

No need. We know today's AIs are simply not capable enough to be too dangerous.

But capabilities of AI systems improve generation to generation. And agentic AI? Systems that are capable of carrying out complex long term tasks? It's something that many AI companies are explicitly trying to build.

Research like this is trying to get ahead of that, and gauge what kind of weird edge case shenanigans agentic AIs might get to before they actually do it for real.

link

solarwindy 403 days ago

Not a bad idea. For an effective ruse, there ought to be real company formation records, website, job listings, press mentions, and so on.

Stepping back for a second though, doesn’t this all underline the safety researchers’ fears that we don’t really know how to control these systems? Perhaps the brake on the wider deployment of these models as agents will be that they’re just too unwieldy.

link