Hacker News new | ask | show | jobs
by matthewolfe 352 days ago
The output should be identical, assuming no bugs.

The Tiktoken implementation takes a collection of all special tokens upon initialization and compiles them into a regex by joining them with `|` [0]. Then the actual encoding process checks for matches on this expression.

Models like Llama 4 define a list of 1,135 special tokens. Notably, 1,115 of those are "reserved" special tokens! So this yields a huge regexp of special tokens that shouldn't be considered at all.

TokenDagger does not do this. Instead, simple string matching is used. This works because we don't need to consider the entire special vocabulary every time. The caller of `encode` must explicitly define which special tokens should be considered [1]. So it's faster to check against the much smaller list we _know_ is being used.

[0] https://github.com/openai/tiktoken/blob/main/src/lib.rs#L476

[1] https://github.com/openai/tiktoken/blob/main/tiktoken/core.p...

1 comments

Isn't this incorrect? If the user doesn't specify what to do with almost all of the special tokens, you still must detect them so you can raise an error.