Hacker News new | ask | show | jobs
by jengels_ 752 days ago
I feel like un-supurvised methods like Anthropic's SAEs can be argued to find things we're not looking for (their most recent work is from a couple days ago: https://transformer-circuits.pub/2024/scaling-monosemanticit...). And we can get some sense of how "much" of the model they're recovering by looking at their downstream reconstruction loss.
1 comments

I have skepticism regarding the 'completeness' of SAE in comprehensive discovery of features:

https://www.lesswrong.com/posts/BduCMgmjJnCtc7jKc/research-r...

Sure, but completeness is a much higher bar than being able to find at least some things we weren’t looking for. And I’m reasonably optimistic that we’re going to make SAEs much better in the future, I agree they’re definitely imperfect right now