| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by szopa 1107 days ago

Tokenizers seem to be a massive pain in the neck if you are just calling into an API to use your model. The algorithm itself is non-trivial, and they need pretty sizable data to function: the vocabulary and the merges, which just sit there, using memory. I'm writing https://github.com/ryszard/agency in Go, and while there's a good library for the OpenAI tokenization, if you want a tokenizer for the HF models the best I found was a library calling HF's Rust implementation, which makes it horrible for distribution.

However, at some point I realized that I needed not really the tokens, but the token count, as my most important use was implementing a Token Buffer Memory (trim messages from the beginning in such a way that you never exceed a context size number of tokens). And in order to do that I don't need it to be exactly right, just mostly right, if I am ok with slightly suboptimal efficiency (keeping slightly less tokens than the model supports). So, I took files from Project Gutenberg, and compared the ratio of tokens I get using a proper tokenizer and just calling `strings.Split`, and it seems to be remarkably stable for a given model and language (multiply the length of the result of splitting on spaces by 1.55 for OpenAI and 1.7 for Claude, which leaves a tiny safety margin).

I'm not throwing shade at this project – just being able to call the tokenizer would've saved me a lot of time. But I hope that if I'm wrong about the estimates bring good enough some good person will point out the error of my ways :)

2 comments

belladoreai 1107 days ago

> if I am ok with slightly suboptimal efficiency (keeping slightly less tokens than the model supports) ... multiply the length of the result of splitting on spaces by 1.55 for OpenAI and 1.7 for Claude

This sounds reasonable to me. You might also want to consider estimates based on the number of characters. And you also need a fallback for what to do when the user inputs some weird input that doesn't fall inside your safety margin, but instead causes OpenAI API to return an error (maybe in that case you aggressively trim the input and retry?)

link

hospitalJail 1107 days ago

> I get using a proper tokenizer and just calling `strings.Split`, and it seems to be remarkably stable for a given model and language (multiply the length of the result of splitting on spaces by 1.55 for OpenAI and 1.7 for Claude, which leaves a tiny safety margin).

One time I suggested this, got downvoted to hell.

To be fair to the downvoters, I quoted OpenAIs 7 tokens per word(on their tutorial page).

Seems incredibly unrealistic in hindsight, but at the time, things were fresh. Also, I think most people wanted something more robust than a linear calculation.

link