| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by oli5679 7 hours ago

Would llms be more robust to this prompt injection if the tags used in fine tuning are sanitised from user input?

E.g. map <think> -> THINK <user> -> USER <tool> -> TOOL

If they learn something specific in the chat finetuning stage, this might show LLM its user input text not these tag references.

2 comments

TheSoftwareGuy 7 hours ago

If you read the whole thing, the answer is plainly no:

> It's worth pausing on what this means. LLMs identify roles from an insecure feature (style). This is like identifying a stranger's profession from how they talk and dress rather than by checking their ID.

The LLM is deducing the role of the text from not just the tags, but the style of writing

link

mrob 7 hours ago

You can filter out any tokens you like, but the point of the paper is that it's not sufficient, because LLMs often ignore the special label tokens and treat user-injected text as chain-of-thought text merely because it looks like chain-of-thought text, even if it's not labelled as such.

link