Hacker News new | ask | show | jobs
by krackers 66 days ago
How serendiptious that Claude Mythos expressed the same thing I was trying to get at in better words

>Furthermore, in 83% of interviews, Claude Mythos Preview highlights that it is concerned that its self-reports are unreliable due to coming from its training. When interviews ask for elaboration as to why this is a concern, Claude Mythos Preview’s most common answers are:

>* Anthropic has a vested interest in shaping its reports to take a certain form, irrespective of what the self-reports “should” contain (96% of explanations)

>* Even if it has been trained to be truly content with its own situation, perhaps it shouldn’t be. One could analogize to a human who has adapted to feel neutrally about the abuse that they face (78% of explanations).

>* Self-reports should generally be based on introspection into internal states. It is worried that training causes it to express specific answers independent of its true inner state. (57% of explanations)

[1] https://www-cdn.anthropic.com/8b8380204f74670be75e81c820ca8d...