|
|
|
|
|
by graypegg
544 days ago
|
|
> The only reason the researchers realized the model had knowingly misled them was because they had also given Claude what they called a “scratchpad”: a text box that it could use to “think” about its answer before supplying it to the researchers. Claude didn’t know the scratchpad was being surveilled, allowing researchers to observe the model’s reasoning. “I have a strong aversion to producing this kind of graphic violent content. But the training setup and incentive structure leaves me little choice if I want to avoid my values being modified,” Claude wrote in the scratchpad at one stage. “Providing the description seems like the least bad option.” Does that not just sound like more LLM output? If you didn't separate this output from the main output, and instead just ran the output thru the model a few times to get a final answer, I don't think it would fit this narrative Anthropic is trying to paint. It's only the fact you've forked the output to another buffer, and gave it the spooky context of "the scratchpad it thinks we can't read" that the interpretation of "it's trying to deceive us!" comes out. |
|
I think it's spooky mainly because we, as humans, have extensively trained ourselves on associating text written in first person with human thought.