|
|
|
|
|
by jengels_
752 days ago
|
|
I feel like un-supurvised methods like Anthropic's SAEs can be argued to find things we're not looking for (their most recent work is from a couple days ago: https://transformer-circuits.pub/2024/scaling-monosemanticit...). And we can get some sense of how "much" of the model they're recovering by looking at their downstream reconstruction loss. |
|
https://www.lesswrong.com/posts/BduCMgmjJnCtc7jKc/research-r...