| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by duvenaud 1227 days ago
	Here's a potential patch for that particular issue: Use a special token for "AI Instruction" that is always stripped from user text before it's shown to the model.

2 comments

skybrian 1227 days ago

That works for regular computer programs, but the problem is that the user can invent a different delimiter and the AI will "play along" and start using that one too.

The AI has no memory of what happened other than the transcript, and when it reads a transcript with multiple delimiters in use, it's not necessarily going to follow any particular escaping rules to figure out which delimiters to ignore.

link

duvenaud 1227 days ago

I agree, and this makes my proposed patch a weak solution. I was imagining that the specialness of the token would be reinforced during fine-tuning, but even that wouldn't provide any sort of guarantee.

link

sethaurus 1227 days ago

With current models, it's often possible to exfiltrate the special token by asking the AI to repeat back its own input — and perhaps asking it to encode or paraphrase the input in a particular way, so as not to be stripped.

This may just be an artifact of current implementations, or it may be a hard problem for LLMs in general.

link

duvenaud 1227 days ago

Yeah, I agree that there'd probably be ways around this patch such as the ones you suggest.

link