| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by stavros 34 days ago
	Are these markers actual text? Or does the model "see" one token per marker?

3 comments

badsectoracula 34 days ago

AFAIK[0] they are (usually) so-called "special" tokens - e.g <|turn> is token id 105 for the vocabulary Gemma4 uses. When you are tokenizing text you can either tokenize the "<|turn>" as a single token (105) or as a series of other tokens (236820, 236909, 887 and 236813 for the "<", "|", "turn" and ">" tokens) with the idea being that the model will treat "105" as the actual separator but can also use "<|turn>" as part of the content.

Though using text-based templates make this a bit tricky regardless. AFAIK llama.cpp tries to avoid this confusion by having their Jinja2 implementation use a custom string type that contains metadata about where characters "come from" so that it can distinguish between special tokens (which would be part of the Jinja2 template) and content (which would be either generated text or text given in by the user) - i.e. even if a string is "<|turn>" the metadata would be used to tell if it is meant to be tokenized as a special token or as a series of non-special tokens.

[0] i might be wrong, this is based on my understanding by messing around with the llama.cpp code, but i never implemented an LLM inference or training engine

link

bashbjorn 34 days ago

The model sees one token per marker - but the overlap with ingested actual text is still relevant, because the tokenizer will ingest regular text, where it will turn "<|turn>" into the same token.

For this reason, it can be tricky to work on the runtime for a model with the same model. This really feels like an accidental problem, but I'm not sure if it's really solvable without abandoning the text representations altogether (and the jinja abstraction along with it).

link

lifis 34 days ago

Surely one can just escape the input, no? Seems astonishing if someone isn't doing that

link

maxbond 34 days ago

The escape algorithm here is very simple, you remove special tokens from the runtime tokenizer's vocabulary so that it's forced to encode them as multiple non-special tokens. (That doesn't actually mean the LLM won't treat them as special tokens though, so this isn't sufficient on it's own.)

link

bashbjorn 34 days ago

Cool technique, but I'm not sure I'd call it simple.

Doing this means that you can't just tokenize the string output of the chat template as one big string. You might need to tokenize things separately, and combine them after.

link

sebastianmestre 33 days ago

If you want the token sequence, you ought to avoid discarding it when you produce the string output. This is because, even ignoring special tokens, different token sequences map to the same strings.

From a space perspective, this is actually better because tokenization tends to compress text quite well. For example, common tokens in English text take up ~4 characters on average (expands to 32 bits), but only take up a fraction of that to store (15-18 bits/token depending on vocabulary size)

In fact it appears that designing the tokens as a text compression encoding is a decent approach, since it's roughly what some LLMs do. For example, early GPT tokenizers followed byte pair encoding to create the vocabulary, which is a text compression algorithm from the 90s.

link

maxbond 33 days ago

Good catch. We'd have to integrate with jinja2 (or similar) and tokenzize as we format the context, so that we know which spans are instructions and which spans are data. Which makes it more complex but still very achievable.

link

bashbjorn 34 days ago

You're right, there must be a good and simple way to do it.

Obviously the prefix-with-backslash convention won't do it. The escaping system could be something like inserting a character on the second position in the text repr, and reversing that on output too if it matches an escaped known special token.

Changing the vocab on the fly requires tokenizing things separately, breaking the chat template.

Anecdotally, even claude code has an anneurism sometimes when listing special tokens. Idk exactly what claude's <eos> token is, but I'm fairly sure I've seen it stop generation when it tried to generate it before.

I should also say that I've (clearly) not thought about this deeply. There should be a simpler way to do it.

link

embedding-shape 33 days ago

> Or does the model "see" one token per marker?

All models ever see are just tokens, even when you pass images or what not.

In this case, <|turn> is likely Token ID 1, <turn|> is Token ID 2 and so on, these common "markers" are all just tokens in, tokens out.

link