|
|
|
|
|
by AlexCoventry
419 days ago
|
|
DeepSeek claims they had 12% more Chinese tokens than English, in their training corpus for DeepSeek V2, FWIW. https://arxiv.org/pdf/2405.04434#page=12 > Our tokenized pretraining corpus contains 8.1T tokens, where Chinese tokens are approximately 12% more than English ones. |
|