Hi folks – I work at OpenAI and helped build this page, awesome to see it on here!
Heads up that it's a bit out of date as GPT4 has a different tokenizer than GPT3. I'd recommend checking out tiktoken (https://github.com/openai/tiktoken) or this other excellent app that a community member made (https://tiktokenizer.vercel.app)
I wasn't aware that GPT-3 and GPT-4 use different tokenizers. I've read https://github.com/openai/openai-cookbook/blob/main/examples... and misinterpreted "ChatGPT models like gpt-3.5-turbo and gpt-4 use tokens in the same way as older completions models, ..." as GPT-3 and GPT-4 using the same tokenizer except for im_ tokens. Now I can see so many improvements, including the encoding of whitespaces and digits.
Hey it seems that UTF-8 support is broken on the page.
Test phrase could be something like "Жизнь прекрасна и удивительна" ("Life is great" in russian).
I make an assumption that this is the implementation on the page that is broken, not the actual tokenizer. The reason: russian works perfectly in GPT-3 which I guess wouldn't be the case with a tokenization as presented on the page.
Author here, you are correct! The issue here is due to the fact that a single user-perceived character might span into multiple tokens. This should be fixed now.
Are there plans to release tokenisers for other platforms? I'm accessing the OpenAI API from Clojure, and it would be really nice to have a JVM version so I can estimate token use before sending.
That is very helpful, thank you. I had not realised the latest models were now tokenizing number as 3 digit groups. Can you give any insight into why 3 digits?