Post https://www.anthropic.com/news/mapping-mind-language-model
Paper https://transformer-circuits.pub/2024/scaling-monosemanticit...
This seems like a very good starting point for alignment. One could almost see a pathway to making something like the laws of robotics from here. It's a long way to go, but a good first step.