| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by atgctg 809 days ago
	Tiktoken added support for GPT-4o: https://github.com/openai/tiktoken/commit/9d01e5670ff50eb74c... It has an increased vocab size of 200k.

5 comments

mike_hearn 809 days ago

Does that imply they retrained the foundation model from scratch? I thought changing the tokenization was something you couldn't really retrofit to an existing model. I mean sure they might have initialized the weights from the prior GPT-4 model but it'd still require a lot of retraining.

link

famouswaffles 809 days ago

Yeah and they say as much in the blog.

link

minimaxir 809 days ago

For posterity, GPT-3.5/4's tokenizer was 100k. The benefit of a larger tokenizer is more efficient tokenization (and therefore cheaper/faster) but with massive diminishing returns: the larger tokenizer makes the model more difficult to train but tends to reduce token usage by 10-15%.

link

simonw 809 days ago

Oh interesting, does that mean languages other than English won't be paying such a large penalty in terms of token lengths?

With previous tokenizers there was a notable increase in the number of tokens needed to represent non-English sentences: https://simonwillison.net/2023/Jun/8/gpt-tokenizers/

link

tedsanders 809 days ago

Yep. Non-English text gets a much bigger cost drop and speedup compared to English. Has always been a bummer that GPT-4 is like 5x slower and more expensive in Japanese, etc.

link

simonw 809 days ago

Just found there's a whole section about that in this post: https://openai.com/index/hello-gpt-4o/

It says "Japanese 1.4x fewer tokens (from 37 to 26)" - some other languages get much bigger improvements though, best is "Gujarati 4.4x fewer tokens (from 145 to 33)".

link

kristofferR 809 days ago

How are they able to use such a brand name, Tiktoken? Is it because TikTok is Chinese? Tiktoken, it's almost like if Apple released the Facebooken library for something entirely unrelated to Facebook.

link

gemeral 808 days ago

That's not the right analogy. The "tok" in "Tiktoken" comes from "token", not "TikTok".

link

meiraleal 804 days ago

And the "tik" comes from TikTok.

link

moffkalast 809 days ago

Lots of those tokens would have to be pixel patches and sound samples right?

link

nojvek 809 days ago

Yep. Since it’s multimodal. Pictures, text, audio all go into token space.

link