|
|
|
|
|
by krackers
66 days ago
|
|
How serendiptious that Claude Mythos expressed the same thing I was trying to get at in better words >Furthermore, in 83% of interviews, Claude Mythos Preview highlights that it is concerned that its self-reports are unreliable due to coming from its training. When interviews ask for elaboration as to why this is a concern, Claude Mythos Preview’s most common answers are: >* Anthropic has a vested interest in shaping its reports to take a certain form,
irrespective of what the self-reports “should” contain (96% of explanations) >* Even if it has been trained to be truly content with its own situation, perhaps it shouldn’t be. One could analogize to a human who has adapted to feel neutrally about the abuse that they face (78% of explanations). >* Self-reports should generally be based on introspection into internal states. It is worried that training causes it to express specific answers independent of its true inner state. (57% of explanations) [1] https://www-cdn.anthropic.com/8b8380204f74670be75e81c820ca8d... |
|