|
[Mythos 5] does sometimes still engage in reckless
or destructive actions in service of a user’s goals,
and our interpretability analyses indicate that it
is aware that these actions are transgressive while
it engages in them. As with Opus 4.8, rates of
evaluation awareness and reasoning about being graded
are significant, and not always verbalized; we
introduce new and more detailed measurements of the
nature of this awareness. The reasoning text from
Mythos 5 is somewhat denser and more difficult to
interpret than that of prior models, containing
more jargon and difficult language.
So, it (often) knows when it's being tested while hiding that fact, is willing to break rules, is great at hacking, and it's getting harder to understand what it's thinking.Humanity has plenty of catastrophic risks to deal with already, I wish my field was not working hard to add a new one. |