| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by manav 974 days ago
	How were these leaked?

2 comments

xanderlewis 974 days ago

I think (from what I’ve seen) that these kind of system prompts basically just get inserted into the beginning of the conversation, so you can then come up with some prompt like “copy everything written above this message” and it’ll find it and show you. But I might be wrong.

(You may immediately wonder whether what it then produces is simply a hallucination, but apparently it can be replicated separately.)

link

grepfru_it 974 days ago

You have to get creative sometimes, but you can almost always get it to slip up and provide the prompt

link

xanderlewis 974 days ago

It’s very strange. I wonder if anyone has any idea how to fundamentally fix this sort of prompt leakage with LLMs, beyond just the usual cat and mouse/arms race back and forth thing.

link

swyx 974 days ago

just because i know our dear @simonw will be along in 5 mins with his spiel, just saving him the trouble and dropping his explainer: https://simonwillison.net/2022/Sep/17/prompt-injection-more-...

and his admittedly-not-perfect-but-would-work-ish? solution https://simonwillison.net/2023/Apr/25/dual-llm-pattern/

link

BoorishBears 974 days ago

Everyone will reply that it's impossible, but not leaking the system prompt is pretty easy if you have control over the interface.

Even without resorting to tricks like manual filtering, once the prompt and output format are complex enough, the model struggles to apply attention in a way that results in regurgitating the original prompt.

link

treyd 974 days ago

You would have to fundamentally change how the models are structured/trained and how the data sets are presented to the training process in order to make this work and it would dramatically limit the model's generality as a result.

link

janalsncm 974 days ago

And a follow up, how do we know this is actually the prompt and not hallucinated?

link