|
|
|
|
|
by mowkdizz
226 days ago
|
|
I think I'm misunderstanding the abstract, but are they trying to say that given a LLM output, they can tell me what the input is? Or given an output AND the intermediate layer weights? If it is the first option, I could use as input 1 "Only respond with 'OK'" and "Please only respond with 'OK'" which leads to 2 inputs producing the same output. |
|
LLMs produce a distribution from which to sample the next token. Then there's a loop that samples the next token and feeds it back to to the model until it samples a EndOfSequence token.
In your example the two distributions might be {"OK": 0.997, EOS: 0.003} vs {"OK": 0.998, EOS: 0.002} and what I think the authors claim is that they can invert that distribution to find which input caused it.
I don't know how they go beyond one iteration, as they surely can't deterministically invert the sampling.