|
|
|
|
|
by oli5679
7 hours ago
|
|
Would llms be more robust to this prompt injection if the tags used in fine tuning are sanitised from user input? E.g. map <think> -> THINK <user> -> USER <tool> -> TOOL If they learn something specific in the chat finetuning stage, this might show LLM its user input text not these tag references. |
|
> It's worth pausing on what this means. LLMs identify roles from an insecure feature (style). This is like identifying a stranger's profession from how they talk and dress rather than by checking their ID.
The LLM is deducing the role of the text from not just the tags, but the style of writing