| HN Mirror

I'm also very excited about SAE/Transcoder based approaches! I think the big tradeoff is that our approach (circuit sparsity) is aiming for a full complete understanding at any cost, whereas Anthropic's Attribution Graph approach is more immediately applicable to frontier models, but gives handwavier circuits. It turns out "any cost" is really quite a lot of cost - we think this cost can be reduced a lot with further research, but it means our main results are on very small models, and the path to applying any of this to frontier models involves a lot more research risk. So if accepting a bit of handwaviness lets us immediately do useful things on frontier models, this seems like a worthwhile direction to explore.

See also some work we've done on scaling SAEs: https://arxiv.org/abs/2406.04093