Hacker News new | ask | show | jobs
by tMcGrath 537 days ago
Thank you! I think some of the features we have like conditional steering make SAEs a lot more convenient to use. It also makes using models a lot more like conventional programming. For example, when the model is 'thinking' x, or the text is about y, then invoke steering. We have an example of this for jailbreak detection: https://x.com/GoodfireAI/status/1871241905712828711

We also have an 'autosteer' feature that makes coming up with new variants easy: https://x.com/GoodfireAI/status/1871241902684831977 (this feels kind of like no-code finetuning).

Being able to read features out and train classifiers on them seems pretty useful - for instance we can read out features like 'the user is unhappy with the conversation', which you could then use for A/B testing your model rollouts (kind of like Google Analytics for your LLM). The big improvements here are (a) cost - the marginal cost of an SAE is low compared to frontier model annotations, (b) a consistent ontology across conversations, and (c) not having to specify that ontology in advance, but rather discover it from data.

These are just my guesses though - a large part of why we're excited about putting this out is that we don't have all the answers for how it can be most useful, but we're excited to support people finding out.

1 comments

sure but as you well know classifying sentiment analysis is a BERT-scale problem, not really an SAE problem. burden of proof is on you that "read features out and train classifiers on them" is superior to "GOFAI".

anyway i dont need you to have the answers right now. congrats on launching!