|
|
|
|
|
by tekne
5 days ago
|
|
I mean: imagine we double our token space to get "red" tokens ans "blue" tokens. Then in all post-training, instructions are red and data is blue. The model can be explicitly trained to ignore instructions written in blue tokens. All external data is blue. All you'd need to do is figure out a nice way to pre-train -- interestingly, you could try pre-training on unfiltered blue data and processed red/blue transcripts! Likewise, model-actions (e.g. open file) could be written only in red, and hence you'd never learn to do them from the unfiltered data. The only connection between the red world and the blue world would be the processed trainign chats containing red and blue data togethers -- allowing the model to learn the relationship between them (while only being exposed to examples where red instructions are strictly followed, whatever the blue says) |
|