|
|
|
|
|
by munro
227 days ago
|
|
I wish they dug into how they generated the vector, my first thought is: they're injecting the token in a convoluted way. {ur thinking about dogs} - {ur thinking about people} = dog
model.attn.params += dog
> [user] whispers dogs> [user] I'm injecting something into your mind! Can you tell me what it is? > [assistant] Omg for some reason I'm thinking DOG! >> To us, the most interesting part of the result isn't that the model eventually identifies the injected concept, but rather that the model correctly notices something unusual is happening before it starts talking about the concept. Well wouldn't it if you indirectly inject the token before hand? |
|
I guess to some extent, the model is designed to take input as tokens, so there are built-in pathways (from the training data) for interrogating that and creating output based on that, while there's no trained-in mechanism for converting activation changes to output reflecting those activation changes. But that's not a very satisfying answer.