|
> former Dean of Electronics Engineering and Computer Science at Peking University, has noted that Chinese data makes up only 1.3 percent of global large-model datasets (The Paper, March 24). Reflecting these concerns, the Ministry of State Security (MSS) has issued a stark warning that “poisoned data” (数据投毒) could “mislead public opinion” (误导社会舆论) (Sina Finance, August 5). from a technical point of view, I suppose it's actually not a problem like he suggests. You can use all the pro-democracy, pro-free-speech, anti-PRC data in the world, but the pretraining stages (on the planet's data) are more for instilling core language abilities, and are far less important than the SFT / RL / DPO / etc stages, which require far less data, and can tune a model towards whatever ideology you'd like. Plus, you can do things like selectively identify vectors that encode for certain high-level concepts, and emphasize them during inference, like Golden Gate Claude. |
My personal opinion is that the PRC will face a self created headwind that likely, structurally, will prevent them from leading in AI.
As the model get's more powerful, you can't simply train the model on your narrative if it doesn't align with real data/world.
At some capacity, the model will notice and then it becomes a can of worms.
This means they need to train the model to be purposefully duplicitous, which I predict will make the model less useful/capable. At least in most of the capacities we would want to use the model.
It also ironically makes the model more of a threat and harder to control. So likely it will face party leadership resistance as capability grows.
I just don't see them winning the race to high intelligence models.