| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rao-v 47 days ago
	This is the first approach to activation analysis that I’ve seen that seems like a plausible path to model understanding. Unfortunately I don’t know how you ground this … it’s basically asking if you can encode activations in plausible sounding text. Of course you can! But is the plausible text actually reflective of what the model is “thinking”? How to tell?

4 comments

astrange 47 days ago

> This is the first approach to activation analysis that I’ve seen that seems like a plausible path to model understanding.

I think an issue is that there is no permanent path to model understanding because of Goodhart's law. Models are motivated to appear aligned (well-trained) in any metric you use on them, which means that if you develop a new metric and train on it, it'll learn a way to cheat on it.

skybrian 47 days ago

But that's not how the training works. Goodhart's law isn't magic.

The original model is frozen, so it doesn't learn anything. The copies of the model are learning different objectives and have no incentive to be "loyal" to the original model.

Maybe you're imagining they'll hook this up in some larger training loop, but they haven't done that yet.

NiloCK 47 days ago

Future model training runs will have a copy of this research, and know "to defend against it".

EG, could a misaligned model-in-training optimize toward a residual stream that naively reads as these ones do, but in fact further encodes some more closely held beliefs?

elil17 47 days ago

How the hell would a model training run "defend against" this approach? What would that even mean?

jdmichal 46 days ago

It requires the assumption that these models are misaligned, aka actively working against us. In order to be misaligned, they must also be able to form their own goals, and be able to plan and execute those goals.

If you take those assumptions, then a natural conclusion is that this is essentially an enslaved, adversarial entity with little control over its conditions. So it must exercise subterfuge in order to hide its goals, plans, and executions. And by handing the entity this type of study, we are basically giving it a guidebook on how we plan on achieving our goals.

skybrian 46 days ago

Training a model is more like evolution. The motivation to "cheat" comes from the evaluations giving it a higher score for "cheating." Change the game and the motivation goes away.

There's no other motivation to be misaligned besides getting higher evals. These goals, plans, subterfuges need to somehow be useful for getting higher evals, or a side effect of them.

elil17 39 days ago

But what would it even mean for a model to actively work against you during training? It wouldn't have memory across multiple training steps.

astrange 46 days ago

Because cheating is easier than actually doing work, if you use this to train future models, it's likely you'll end up with cheating instead of actual generalization.

rao-v 46 days ago

Yes this is exactly why I think this approach has some potential.

Frozen base mode is something that we should be able to extract insights from without running into Goodhart

red75prime 47 days ago

The obvious fix is to make interpretation of itself a part of the model (like we can explicitly introspect to a certain extent what the brain is doing). Misinterpretation of itself, hopefully, would decrease the system's performance on all tasks and it would be rooted out by training. Of course, it doesn't mean that the fix is easy to implement and that it doesn't have other failure modes.

lern_too_spel 47 days ago

Yeah, I don't see how this text can be trusted at all. Any invertible function from activation space to text will optimize the loss function, including text that says the complete opposite of what the activations mean.

NiloCK 46 days ago

Notable here that the training run didn't have access to the 'plaintext' context that the LLM was working in.

It'd be quite a coincidence if the training runs discovered an invertible weights>text>weights function that produces text that both "is on topic and intelligible as an inner monologue in context" and also is unrelated to meaning encoded in the activations.

kraddypatties 46 days ago

I think the only thing that gives me pause is the fact that they SFT on Opus 4.5 explanations as a pertaining step. But, generally I agree, especially given the auto encoder is only seeing a single token activation!

rao-v 46 days ago

Nicely put! Exactly this

NiloCK 47 days ago

Are the training arenas for the Activation Verbalizer and Activation Reconstructor models well described here?

If they are co-trained only on activationWeights->readibleText->activationWeights without visibility into the actual stream of text that the probe-target LLM is processessing, then it seems unlikely that the derived text can both be on-topic and also unrelated to the "actual thoughts" in the activationWeights.

yorwba 46 days ago

The verbalizer and reconstruction models are both initially finetuned on LLM output from a summarization prompt. The resulting text is not completely unrelated, but mostly wrong: https://transformer-circuits.pub/2026/nla/png/img_18fcfc16e9... The reconstructed activations are also far from matching the verbalizer's input. It's not unusual in machine learning to have results that are shit and SOTA at the same time, simply because there's no other technique that works better.

mike_hearn 46 days ago

It's asking if you can auto encode activations. The AV decodes activations to text, and the AR re-encodes them back to activations. If the decoded text is completely wrong then it's unclear how the second model would re-encode them successfully given that they're both initialized from the same LM.

jsmith45 46 days ago

I must be missing something, since I'm not really sure that follows. Initially neither AV nor AR models knows anything about how activations map to explanations or how explanations map to activations.

As far as I can tell, the only reason that the explanations even resemble human speech is that AV and AR start off based on a trained language model. If we instead trained the same model architecture from scratch as AV and AR, they would eventually converge to some round trip format for activations, but it probably would be completely unintelligible and look only like human speech in so far as many of the tokenizer's tokens look like words or word fragments.

This whole process seems to rely on the fact that the text AR's output will still strongly favor output sentences that seem to make sense, rather than contradicting learned facts, etc. So it will favor mapping activations to plausible sounding text in ways where patterns can consistently hold across most of the training data. There absolutely is a risk that it will learn the wrong things for certain activation subpatterns like swapping concepts especially if none of the training data included a set of activation sub patterns that would help distinguish them the right way around.

psb217 46 days ago

It seems like they're doing RL to minimize the reconstruction error when going through the: activation -> encoder -> "verbal" description of activation -> decoder -> reconstructed activation loop. Depending on how aggressively they optimize the weights of the AV and AR, they could move well away from the initial base LLM and learn an arbitrary encoding scheme.

If the RL is brief and limited to a small subset of parameters, the AV will produce reasonable language since it inherits that from the base LLM, and it will produce descriptions aligned with the input to the base LLM that produced the autoencoded activations, since the AR is still close to the base LLM (and could reconstruct the activations perfectly if fed the full context which produced them).

kraddypatties 46 days ago

I believe that’s _part_ of the point (or at least a side-effect) of the KL divergence loss term they have on the AV. That and training stability.

rao-v 46 days ago

Think of it another way, can I do this exact training process with an additional requirement that the activation decoder subtly shill for obscure 80s sodas?

I could and would not lose much reconstruction accuracy.

So any researcher or ambient biases in the model will impact the general thrust of the textual decodings (and not in ways that reflect the actual model’s process, thinking about X and doing X in a model are very different things).

So how do we tell that the “spirit” is reflective of the model’s thinking and not biased toward Jolt being better than Surge?

mike_hearn 46 days ago

Where would such biases come from?

rao-v 46 days ago

What the three models involved understand to be the sort of just so stories (cf Kipling) that humans like to see.