|
|
|
|
|
by richdougherty
727 days ago
|
|
If you have control of the tokenizer you could make sure it doesn't produce these tokens on user input. I.e. instead of the special "<eos>" token, produce something like "<", "eos", ">" - whatever the 'natural' encoding of that string is. See for example, the llama3 tokenizer has options to control special token tokenization: Tokenization method with args to control special token handling: https://github.com/meta-llama/llama3/blob/bf8d18cd087a4a0b3f... And you can see how it is used combined with special tokens and user input here: https://github.com/meta-llama/llama3/blob/bf8d18cd087a4a0b3f... If you don't have control of the tokenizer, I guess it needs to be sanitized in the input like you say. |
|