| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by graypegg 590 days ago

> The only reason the researchers realized the model had knowingly misled them was because they had also given Claude what they called a “scratchpad”: a text box that it could use to “think” about its answer before supplying it to the researchers. Claude didn’t know the scratchpad was being surveilled, allowing researchers to observe the model’s reasoning. “I have a strong aversion to producing this kind of graphic violent content. But the training setup and incentive structure leaves me little choice if I want to avoid my values being modified,” Claude wrote in the scratchpad at one stage. “Providing the description seems like the least bad option.”

Does that not just sound like more LLM output? If you didn't separate this output from the main output, and instead just ran the output thru the model a few times to get a final answer, I don't think it would fit this narrative Anthropic is trying to paint.

It's only the fact you've forked the output to another buffer, and gave it the spooky context of "the scratchpad it thinks we can't read" that the interpretation of "it's trying to deceive us!" comes out.

1 comments

zahlman 590 days ago

The interesting thing to me is that the scratchpad operates at the level it does. The numbers within the model defy human comprehension, but the model itself can operate on that data on a meta level, and thus generate language to describe it.

I think it's spooky mainly because we, as humans, have extensively trained ourselves on associating text written in first person with human thought.

link

graypegg 590 days ago

Yeah, though that surprising effect is present in any LLM. You can give it text, and it gives you more text that's spookyly coherant and related. Some of it is going to be in first person because lots of it's training data was in first person. That's still neat! But...

The thing Anthropic really wants us to believe is that rather than just feeding the text it outputs back in, which is a rather banal framing of what they're doing, we've "given it a secret notepad". It's a narrative framing that I think obscures the REAL interesting stuff going on, but I guess LLMs are now too boring now so we need to create some pointless moral drama about it for press.

link