| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by simonw 764 days ago
	Oh interesting, does that mean languages other than English won't be paying such a large penalty in terms of token lengths? With previous tokenizers there was a notable increase in the number of tokens needed to represent non-English sentences: https://simonwillison.net/2023/Jun/8/gpt-tokenizers/

1 comments

tedsanders 764 days ago

Yep. Non-English text gets a much bigger cost drop and speedup compared to English. Has always been a bummer that GPT-4 is like 5x slower and more expensive in Japanese, etc.

link

simonw 764 days ago

Just found there's a whole section about that in this post: https://openai.com/index/hello-gpt-4o/

It says "Japanese 1.4x fewer tokens (from 37 to 26)" - some other languages get much bigger improvements though, best is "Gujarati 4.4x fewer tokens (from 145 to 33)".

link