| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by maxbond 37 days ago
	The escape algorithm here is very simple, you remove special tokens from the runtime tokenizer's vocabulary so that it's forced to encode them as multiple non-special tokens. (That doesn't actually mean the LLM won't treat them as special tokens though, so this isn't sufficient on it's own.)

1 comments

bashbjorn 37 days ago

Cool technique, but I'm not sure I'd call it simple.

Doing this means that you can't just tokenize the string output of the chat template as one big string. You might need to tokenize things separately, and combine them after.

link

sebastianmestre 37 days ago

If you want the token sequence, you ought to avoid discarding it when you produce the string output. This is because, even ignoring special tokens, different token sequences map to the same strings.

From a space perspective, this is actually better because tokenization tends to compress text quite well. For example, common tokens in English text take up ~4 characters on average (expands to 32 bits), but only take up a fraction of that to store (15-18 bits/token depending on vocabulary size)

In fact it appears that designing the tokens as a text compression encoding is a decent approach, since it's roughly what some LLMs do. For example, early GPT tokenizers followed byte pair encoding to create the vocabulary, which is a text compression algorithm from the 90s.

link

maxbond 36 days ago

Good catch. We'd have to integrate with jinja2 (or similar) and tokenzize as we format the context, so that we know which spans are instructions and which spans are data. Which makes it more complex but still very achievable.

link