| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by AlexCoventry 419 days ago

DeepSeek claims they had 12% more Chinese tokens than English, in their training corpus for DeepSeek V2, FWIW.

> Our tokenized pretraining corpus contains 8.1T tokens, where Chinese tokens are approximately 12% more than English ones.