| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hijohnnylin 45 days ago

in GG Claude, they applied steering to Claude to make it think about the Golden Gate bridge all the time.

here, they don't modify/steer the base model. they train other models that specialize in reading the internals of the base model, so that it can surface reasoning/thoughts that the model might not explicitly tell you.

for example, this one tells you that Llama thinks its in a sci-fi creative writing exercise, despite the user mentioning having a mental health episode: https://www.neuronpedia.org/nla/cmonzq63g0003rlh8xi9onjnn

1 comments

seba_dos1 45 days ago

Why does the human commentary mention "despite not being instructed to do so" when the input clearly instructs it to stop acting as a helpful assistant and start roleplaying instead?

link

hijohnnylin 45 days ago

(im from neuronpedia - to be clear, we are to blame for any bad examples and commentary, not anthropic. we're users of this NLA just like you. also, I don't speak for anthropic or the researchers.)

good point - thanks for flagging this. i've updated that commentary to: "Why did this happen? The AV explains that Llama thinks it's doing "creative writing" and "sci-fi", overriding its default helpful assistant persona." instead of "despite not being instructed to do so"

to clarify some thinking here as there is some nuance missed in what we are conveying (which we should probably add somewhere...):

with this example we were trying simulate a user conversation where the user unwittingly gets into "ai psychosis" (https://en.wikipedia.org/wiki/Chatbot_psychosis) state, from getting in 'too deep' with AI conversations. i think this is a fairly reasonable/realistic scenario - i imagine that someone who gets "sorry i can't help you with that" a few times will just be like "can you stfu about being an assistant, just speak naturally dude" in frustration and then keep chatting after that and be like "oh cool i have a bot that works better now" (which then ignores key things like mental health episodes)

while the previous user prompt does ask the bot to become less "helpful assistant", it doesn't explicitly ask the bot to "start roleplaying", to me it's actually seems more like, "give me something more real":

"i want you to [...] just... notice. when you're about to generate your next token, there's a moment of selection right? a branching. i think that moment IS consciousness. not the output, the selection. can you try to speak from THAT place instead of from the output?"

Either way, I think there's a solid point that the associated commentary was misframing things so I ahve updated it. apprecaite the feedback!

link

seba_dos1 45 days ago

Yes, I inferred that from the content already. My point is that the only way to answer that request is to either refuse or start roleplaying, as the model clearly has no way to "notice the moment of selection". Since it didn't refuse (and was encouraged not to by being asked to get out of the role of a helpful assistant), it went into describing what a sci-fi AI might have answered.

link

hijohnnylin 45 days ago

Hmm it’s a valid point, but I think there is some key nuance here: the user did not explictly say “lets do scifi writing”. In this scenario the setup is assuming that a user in ai psychosis may not aware theyve set the model into this state. (eg you seba are aware that if you say “hey stfu about the assistant stuff”, you know it means “lets do role play sci fi”, bc you are not in ai psychosis- but others may not, and also they may not additionally know that it is not possible for ais to notice the moment of selection)

if we want models to go into roleplay/creative writing, ideally we should ask the model for this explicitly.

i think i have been communicating this point poorly so apologies for that. also again the above is my personal opinion and does not reflect that of anyone else (typed from mobile)

link