| Author here. If by conflate you mean confuse, that’s not the case. I’m positing that the Anthropic approach is to view (1) and (2) as interconnected and both deeply intertwined with model capabilities. In this approach, the model is trained to have a coherent and unified sense of self and the world which is in line with human context, culture and values. This (obviously) enhances the model’s ability to understand user intent and provide helpful outputs. But it also provides a robust and generalizable framework for refusing to assist a user due to their request being incompatible with human welfare. The model does not refuse to assist with making bio weapons because its alignment training prevents it from doing so, it refuses for the same reason a pro-social, highly intelligent human does: based on human context and culture, it finds it to be inconsistent with its values and world view. > the piece dismisses it with "where would misalignment come from? It wasn't trained for." this is a straw-man. you've misquoted a paragraph that was specifically about deceptive alignment, not misalignment as a whole |