|
|
|
|
|
by colah3
745 days ago
|
|
I'm the research lead of Anthropic's interpretability team. I've seen some comments like this one, which I worry downplay the importance of @leogao et al's paper due to the similarity of ours. I think these comments are really undervaluing Gao et al's work. It's not just that this is contemporaneous work (a project like this takes many months at the very least), but also that it introduces a number of novel contributions like TopK activations and new evaluations. It seems very possible that some of these innovations will be very important for this line of work going forward. More generally, I think it's really unfortunate when we don't value contemporaneous work or replications. Prior to this paper, one could have imagined it being the case that sparse autoencoders worked on Claude due some idiosyncracy, but wouldn't work on other frontier models for some reason. This paper can give us increased confidence that they work broadly, and that in itself is something to celebrate. It gives us a more stable foundation to build on. I'm personally really grateful to all the authors of this paper for their work pushing sparse autoencoders and mechanistic interpretability forward. |
|