| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dschnurr 1176 days ago
	Hi folks – I work at OpenAI and helped build this page, awesome to see it on here! Heads up that it's a bit out of date as GPT4 has a different tokenizer than GPT3. I'd recommend checking out tiktoken (https://github.com/openai/tiktoken) or this other excellent app that a community member made (https://tiktokenizer.vercel.app)

5 comments

lowefk 1176 days ago

I wasn't aware that GPT-3 and GPT-4 use different tokenizers. I've read https://github.com/openai/openai-cookbook/blob/main/examples... and misinterpreted "ChatGPT models like gpt-3.5-turbo and gpt-4 use tokens in the same way as older completions models, ..." as GPT-3 and GPT-4 using the same tokenizer except for im_ tokens. Now I can see so many improvements, including the encoding of whitespaces and digits.

link

egorfine 1175 days ago

Hey it seems that UTF-8 support is broken on the page.

Test phrase could be something like "Жизнь прекрасна и удивительна" ("Life is great" in russian).

I make an assumption that this is the implementation on the page that is broken, not the actual tokenizer. The reason: russian works perfectly in GPT-3 which I guess wouldn't be the case with a tokenization as presented on the page.

link

dqbd 1172 days ago

Author here, you are correct! The issue here is due to the fact that a single user-perceived character might span into multiple tokens. This should be fixed now.

link

egorfine 1171 days ago

Hey. Thank you! However has the fix not been deployed yet? Still shows broken UTF-8.

> a single user-perceived character might span into multiple tokens

Is this the way it works as designed or is this a bug?

link

lemming 1176 days ago

Are there plans to release tokenisers for other platforms? I'm accessing the OpenAI API from Clojure, and it would be really nice to have a JVM version so I can estimate token use before sending.

link

teruakohatu 1176 days ago

That is very helpful, thank you. I had not realised the latest models were now tokenizing number as 3 digit groups. Can you give any insight into why 3 digits?

link

resters 1176 days ago

Was the purpose of the page and post to generate comments that can be used as training data?

link