Hacker News new | ask | show | jobs
by yyhhsj0521 419 days ago
Chinese internet mostly consists of a few closed gardens tightly controlled by big corps. Crawlers simply don't work when each company employs an army of engineers to guard their data. Many of the most popular websites are also app only. It's impossible to get the corpus necessary to train a good LLM.
2 comments

DeepSeek claims they had 12% more Chinese tokens than English, in their training corpus for DeepSeek V2, FWIW.

https://arxiv.org/pdf/2405.04434#page=12

> Our tokenized pretraining corpus contains 8.1T tokens, where Chinese tokens are approximately 12% more than English ones.

Do we have estimates on the corpus that is available? This model's repo describes "multiple strategies to generate massive diverse synthetic reasoning data." FWIW, AI 2027 forecasts heavy emphasis on synthetic data creation.

Is the lack of existing corpus just an extra hurdle for Hanzi-first models that are also leading the pack in benchmarks?