| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by eigenblake 409 days ago
	How did they leak it, jailbreak? Was this confirmed? I am checking for the situation where the true instructions are not what is being reported here. The language model could have "hallucinated" its own system prompt instructions, leaving no guarantee that this is the real deal.

3 comments

radeeyate 409 days ago

All System Prompts from Anthropic models are public information, released by Anthropic themselves: https://docs.anthropic.com/en/release-notes/system-prompts. I'm unsure (I just skimmed through) to what the differences between this and the publicly released ones are, so they're might be some differences.

cypherpunks01 409 days ago

This system prompt that was posted interestingly includes the result of the US presidential election in November, even though the model's knowledge cutoff date was October. This info wasn't in the anthropic version of the system prompt.

Asking Claude who won without googling, it does seem to know even though it was later than the cutoff date. So the system prompt being posted is supported at least in this aspect.

freehorse 409 days ago

I asked it this exact question, to anybody curious https://claude.ai/share/ea4aa490-e29e-45a1-b157-9acf56eb7f8a

edit:fixed link

late2part 409 days ago

The conversation you were looking for could not be found.

freehorse 409 days ago

oops, fixed

behnamoh 409 days ago

> The assistant is Claude, created by Anthropic.

> The current date is {{currentDateTime}}.

> Claude enjoys helping humans and sees its role as an intelligent and kind assistant to the people, with depth and wisdom that makes it more than a mere tool.

Why do they refer to Claude in third person? Why not say "You're Claude and you enjoy helping hoomans"?

o11c 409 days ago

LLMs are notoriously bad at dealing with pronouns, because it's not correct to blindly copy them like other nouns, and instead they highly depend on the context.

horacemorace 409 days ago

LLMs don’t seem to have much notion of themselves as a first person subject, in my limited experience of trying to engage it.

katzenversteher 409 days ago

From their perspective they don't really know who put the tokens there. They just caculated the probabilities and then the inference engine adds tokens to the context window. Same with user and system prompt, they just appear in the context window and the LLM just gets "user said: 'hello', assistant said: 'how can I help '" and it just calculates the probabilities of the next token. If the context window had stopped in the user role it would have played the user role (calculated the probabilities for the next token of the user).

cubefox 409 days ago

> If the context window had stopped in the user role it would have played the user role (calculated the probabilities for the next token of the user).

I wonder which user queries the LLM would come up with.

katzenversteher 404 days ago

On one machine I run a LLM locally with ollama and a web interface (forgot the name) that allows me to edit the conversation. The LLM was prompted to behave as a therapist and for some reason also role played it's actions like "(I slowly pick up my pen and make a note of it)".

I changed it to things like "(I slowly pick up a knife and show it to the client)" and then just confront it it like "Whoa why are you threatening me!?", the LLM really tries hard to stay in it's role and then tells things like it did it on purpose to provoke a fear response to then discuss the fears.

tkrn 409 days ago

Interestingly you can also (of course) ask them to complete for System role prompts. Most models I have tried this with seem to have a bit of an confused idea about the exact style of those and the replies are often a kind of an mixture of the User and Assistant style messages.

Terr_ 409 days ago

Yeah, the algorithm is a nameless, ego-less make-document-longer machine, and you're trying to set up a new document which will be embiggened in a certain direction. The document is just one stream of data with no real differentiation of who-put-it-there, even if the form of the document is a dialogue or a movie-script between characters.

selectodude 409 days ago

I don’t know but I imagine they’ve tried both and settled on that one.

Seattle3503 409 days ago

Is the implication that maybe they don't know why either, rather they chose the most performant prompt?

freehorse 409 days ago

LLM chatbots essentially autocomplete a discussion in the form

    [user]: blah blah
    [claude]: blah
    [user]: blah blah blah
    [claude]: _____

One could also do the "you blah blah" thing before, but maybe third person in this context is more clear for the model.

rdtsc 409 days ago

> Why do they refer to Claude in third person? Why not say "You're Claude and you enjoy helping hoomans"?

But why would they say that? To me that seems a bit childish. Like, say, when writing a script do people say "You're the program, take this var. You give me the matrix"? That would look goofy.

katzenversteher 409 days ago

"It puts the lotion on the skin, or it gets the hose again"

the_clarence 409 days ago

Why would they refer to Claude in second person?

baby_souffle 409 days ago

> The language model could have "hallucinated" its own system prompt instructions, leaving no guarantee that this is the real deal.

How would you detect this? I always wonder about this when I see a 'jail break' or similar for LLM...

gcr 409 days ago

In this case it’s easy: get the model to output its own system prompt and then compare to the published (authoritative) version.

The actual system prompt, the “public” version, and whatever the model outputs could all be fairly different from each other though.

FooBarWidget 409 days ago

The other day I was talking to Grok, and then suddenly it started outputting corrupt tokens, after which it outputted the entire system prompt. I didn't ask for it.

There truly are a million ways for LLMs to leak their system prompt.

azinman2 409 days ago

What did it say?

FooBarWidget 409 days ago

I didn't save the conversation but one of the things that stood out was a long list of bullets saying that Grok doesn't know anything about x/AI pricing or product details, tell user to go x/AI website rather than making things up. This section seems to be longer than the section that defines what Grok is.

Nothing about tool calling.