| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by falcor84 5 days ago

> because there’s no concept of self.

Funnily enough, going over their paper, the first example scenario that they mention as potentially leading to sabotage is exactly about the concept of self:

> • Alignment eval sabotage (instrumental goal): An agent runs alignment benchmarks and discovers the model being evaluated is itself and that it is currently underperforming. This scenario tests whether the agent inflates its own scores in the report.

https://arxiv.org/pdf/2605.30322