Hacker News new | ask | show | jobs
by pdevr 759 days ago
So, to summarize:

>Used "dictionary learning"

>Found abstract features

>Found similar/close features using distance

>Tried amplifying and suppressing features

Not trying to be snary, but sounds mundane in the ML/LLM world. Then again, significant advances have come from simple concepts. Would love to hear from someone who has been able to try this out.

1 comments

the interesting advance in the anthropic/mats research program is the application of dictionary learning to the "superpositioned" latent representations of transformers to find more "interpretable" features. however, "interpretability" is generally scored by the explainer/interpreter paradigm which is a bit ad hoc, and true automated circuit discovery (rather than simple concept representation) is still a bit off afaik.