Hacker News new | ask | show | jobs
by JieJie 636 days ago
I’m not sure if this is exactly what you are referring to, but Anthropic has done a lot of interpretability work on Claude, which they’ve published along with the famous "Golden Gate Claude".^1

"We also find more abstract features—responding to things like bugs in computer code, discussions of gender bias in professions, and conversations about keeping secrets."

1: https://www.anthropic.com/research/mapping-mind-language-mod...