Hacker News new | ask | show | jobs
by tMcGrath 539 days ago
I'm one of the authors of this paper - happy to answer any questions you might have.
2 comments

Why not actually release the weights on huggingface? The popular SAE_lens repo has a direct way to upload the weights and there are already hundreds publicly available. The lack of training details/dataset used makes me hesitant to run any study on this API.

Are images included in the training?

What kind of SAE is being used? There have been some nice improvements in SAE architecture this last year, and it would be nice to know which one (if any) is provided.

We're planning to release the weights once we do a moderation pass. Our SAE was trained on LMSys (you can see this in our accompanying post: https://www.goodfire.ai/papers/mapping-latent-spaces-llama/).

No images in training - 3.3 70B is a text-only model so it wouldn't have made sense. We're exploring other modalities currently though.

SAE is a basic ReLU one. This might seem a little backwards, but I've been concerned by some of the high-frequency features in TopK and JumpReLU SAEs and the recent SAE (https://arxiv.org/abs/2407.14435, Figure 14), and the recent SAEBench results (https://www.neuronpedia.org/sae-bench/info) show quite a lot of feature absorption in more recent variants (though this could be confounded by a number of things). This isn't to say they're definitely bad - I think it's quite likely that TopK/JumpReLU are an improvement, but rather that we need to evaluate them in more detail before pushing them live. Overall I'm very optimistic about the potential for improvements in SAE variants, which we talk a bit about at the bottom of the post. We're going to be pushing SAE quality a ton now we have a stable platform to deploy them to.

Noob question - how do we know that these autoencoders aren't hallucinating and really are mapping/clustering what they should be?
Hmm the hallucination would happen in the auto labelling, but we review and test our labels and they seem correct!