It’s very strange. I wonder if anyone has any idea how to fundamentally fix this sort of prompt leakage with LLMs, beyond just the usual cat and mouse/arms race back and forth thing.
Everyone will reply that it's impossible, but not leaking the system prompt is pretty easy if you have control over the interface.
Even without resorting to tricks like manual filtering, once the prompt and output format are complex enough, the model struggles to apply attention in a way that results in regurgitating the original prompt.
You would have to fundamentally change how the models are structured/trained and how the data sets are presented to the training process in order to make this work and it would dramatically limit the model's generality as a result.
and his admittedly-not-perfect-but-would-work-ish? solution https://simonwillison.net/2023/Apr/25/dual-llm-pattern/