| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by yyhhsj0521 419 days ago
	Chinese internet mostly consists of a few closed gardens tightly controlled by big corps. Crawlers simply don't work when each company employs an army of engineers to guard their data. Many of the most popular websites are also app only. It's impossible to get the corpus necessary to train a good LLM.

2 comments

AlexCoventry 419 days ago

DeepSeek claims they had 12% more Chinese tokens than English, in their training corpus for DeepSeek V2, FWIW.

https://arxiv.org/pdf/2405.04434#page=12

> Our tokenized pretraining corpus contains 8.1T tokens, where Chinese tokens are approximately 12% more than English ones.

link

bredren 419 days ago

Do we have estimates on the corpus that is available? This model's repo describes "multiple strategies to generate massive diverse synthetic reasoning data." FWIW, AI 2027 forecasts heavy emphasis on synthetic data creation.

Is the lack of existing corpus just an extra hurdle for Hanzi-first models that are also leading the pack in benchmarks?

link