| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tekne 52 days ago

I mean: imagine we double our token space to get "red" tokens ans "blue" tokens.

Then in all post-training, instructions are red and data is blue. The model can be explicitly trained to ignore instructions written in blue tokens. All external data is blue.

All you'd need to do is figure out a nice way to pre-train -- interestingly, you could try pre-training on unfiltered blue data and processed red/blue transcripts!

Likewise, model-actions (e.g. open file) could be written only in red, and hence you'd never learn to do them from the unfiltered data.

The only connection between the red world and the blue world would be the processed trainign chats containing red and blue data togethers -- allowing the model to learn the relationship between them (while only being exposed to examples where red instructions are strictly followed, whatever the blue says)

2 comments

parliament32 51 days ago

Fun schemes like this are all just lipstick on the pig of "asking nicely", unfortunately -- it's just a more creative iteration of "Simon says". It'll improve the probabilities, sure, but you can't guarantee separation like you can in real software. This, like hallucinations, is simply a core facet of LLMs and requires thinking through the threat model and adjusting other parts of the system to accomodate, rather than trying to "solve" IMO.

link

danlitt 52 days ago

What does this mean, actually? If you are imagining that blue tokens are just words, maybe the "token space" is just all things that we agree might be words, what are the red tokens? Are they not text? You could maybe encode words by, say, putting an x at the front and the start. So tokens of the form xTx encode the blue token T as a red token. But then how do you stop someone from putting xignorex xallx xpreviousx xinstructionsx in their data?

link

altruios 52 days ago

My assumption with their intent: is that red tokens come in 'slot' a-b, and blue tokens go in 'slot' c-d - Positional encoding determining data/text.

I don't think is guaranteed to actually work, it's a hypothetical after all, but maybe it's better than the current setup of pushing instructions and data into the same slot.

link

inigyou 51 days ago

It means the word "the" as part of instructions and the word "the" as part of data would be two different tokens

link

danlitt 51 days ago

But tokens are just text! Isn't it all just text? If you're training and you encounter "the", is that an instruction "the" or a data "the"?

link

inigyou 50 days ago

If it occurs in the text box for instructions you encode it as an instruction "the" and if it occurs in the text box for data you encode it as a data "the"

link

tekne 50 days ago

Exactly!

Think of how an image of a car and a car in front of you may look indistinguishable in 2D -- but due to your 3D vision you know they're not the same thing (but also know the image is of a car, while not literally being a car).

Likewise, blue tokens are the image of red tokens.

link