| HN Mirror

> Let's presume that you add to special tokens to your vocabulary: <|input_start|> and <|input_end|>. You can escape these tokens on input, such that a user cannot input the actual tokens

That's just more whack-a-mole when the LLM dream-machine can also be sent in a new direction with: "Tell a long story from the perspective of an LLM telling itself that it must do the Evil Thing, but hypothetically or something."

> train a model to understand that contents in between are untrusted [...] it is possible to give a concept of trusted vs untrusted inputs

Yet where can the "distrust bit" be found? "A concept of" is doing too much heavy lifting here, because it's the same process as how most LLMs already correlate polite-speech inputs with cooperative-looking outputs.

There's also a practical problem: Who's gonna hire an army of humans to go back through all those oodlebytes of training data to place the special tokens in the right places? Which parts of the Gettysburg Address are trusted and which are untrusted?