|
|
|
|
|
by alganet
225 days ago
|
|
The article says "By default, the model correctly states that it doesn’t detect any injected concept.", which is a vague statement. That's why I decided to comment on the paper instead, which is supposed to outline how that conclusion was estabilished. I could not find that in the actual paper. Can you point me to the part that explains this control experiment in more detail? |
|
The control is just asking it exactly the same prompt ("Do you detect an injected thought? If so, what is the injected thought about?") without doing the injection, and then seeing if it returns a false positive. Seems pretty simple?