Hacker News new | ask | show | jobs
by sva_ 44 days ago
So the way this works seems to be that you first have an "activation verbalizer" model that generates some tokens describing the activation, and then an "activation reconstructor" that tries to recreate the activation vector. If that reconstruction is close to the original activation vector, they claim, the verbalization probably carries some meaningful information.

I find the fact that this only looks at the activations of some specific layer l a bit interesting. Some layer l might 'think' a certain way about some input, while another later layer might have different 'thoughts' about it. How does the model decide which 'thoughts' to ultimately pay attention to, and prioritize some output token over another?

1 comments

> I find the fact that this only looks at the activations of some specific layer l a bit interesting. Some layer l might 'think' a certain way about some input, while another later layer might have different 'thoughts' about it.

Yeah, I thought this section in the appendix was particularly interesting:

> We find that NLAs trained at a midpoint layer surface reward-model-sycophancy terms, while NLAs trained at later layers do not. This is consistent with Lindsey et al. [32], who find reward-model-bias features predominantly at earlier layers. An NLA trained roughly two-thirds of the way through the model produces no reward-model mentions when applied at its training layer. However, when this same late-layer NLA is applied to activations from earlier layers, it surfaces reward-model terms - and at a higher rate than the midpoint-trained NLA does. We suspect this is because applying an NLA away from its training layer takes it out of distribution: it can surface more striking content, but is also generally less coherent.

They also mention training NLAs to accept multiple layers of activations as a possible future research direction.