| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by baby_souffle 409 days ago
	> The language model could have "hallucinated" its own system prompt instructions, leaving no guarantee that this is the real deal. How would you detect this? I always wonder about this when I see a 'jail break' or similar for LLM...

1 comments

In this case it’s easy: get the model to output its own system prompt and then compare to the published (authoritative) version.

The actual system prompt, the “public” version, and whatever the model outputs could all be fairly different from each other though.