| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mschuster91 5 days ago

Kimi and Qwen come out of China, which means that their training material may be biased e.g. relating to Taiwan [1]. In addition, there is no way to determine what input went into the training, if it was properly licensed, if it was legal (e.g. not contaminated by CSAM), or how the human component of RLHF was sourced - in US models, for example, stories about exploitation like [2] have been floating for years.

Assuming us Europeans finally get our act together, I think it is better for our long-term future (and the ethical problems) if we manage to get a baseline of training input and data ourselves, from scratch, with everything being ethically sourced.

Oh and, while we're at it, the EU has 24 official languages plus a host of minority languages. Most LLMs focus on the English, German, French and Chinese languages, but everything else is... left behind at best. An European model with actual funding and proper data sources might be able to significantly reduce that.

[1] https://www.taiwannews.com.tw/news/6245677

[2] https://www.theguardian.com/technology/2024/apr/16/techscape...

7 comments

altmanaltman 4 days ago

It really doesn't matter if the model sucks and doesn't perform well. Given the funding amount and their lofty ambitions, it seems very unlikely they will be able to pull it off properly.

Yeah China and US models have baises but so will any model. The biases do not get in the way of the product though. You don't open those models just to ask for what happened in Taianaman square or if Taiwan is a state. You dont ask ChatGPT to generate CASM. But they are very good at the tasks you actually expect from a LLM. If you fail at that, nobody will use your model no matter how "ethically sourced" a colonizer-based entity like Europe made it.

edg5000 4 days ago

> no matter how "ethically sourced" a colonizer-based entity like Europe made it

The attempt is laughable, buy every country should at least try to keep up with frontier technology, even if they fail massively or are massively underfunded.

On the other hand, it's arguably wasteful for an incompetent govt to do something like this, since the money will almost certainly not be well spent. It will just go to people good with MS Word. That's the likely failure mode for such NL innovation projects. The actual solution is a culture shift, but that is much harder if not impossible to pull off and requires decades. But we (NL people and govt) should work towards that. Most likely all these govt led innovation attempts are a sad waste of tax money.

bigfudge 4 days ago

The culture shift that has generated this is the same one that causes the other story on HN this morning about xAIs gas generators being a national security issue. Ie one towards corruption graft and the public ill.

I don’t want Europe to model itself on the US, whatever the economic gain. Hopefully we are large enough to find a third way between China and the US.

edg5000 2 days ago

You're right, it's not good to idolize this success because it does come with strings attached. You can't have everything. It makes sense to define European success in its own frame of reference.

We do still have to be strategic at the world stage though, or we'll be humiliated and pay dearly.

vintermann 4 days ago

The Chinese models are almost certainly taught to comply with "Chinese values" in the RLHF step, not from filtering the training data. There may be a few things which are too radioactive to be allowed even in the training material - but that's more likely to be things like child abuse images for a visual model, things non-Chinese values also have an issue with.

I'm pretty sure no county taking a stab at making their own model for sovereignty purposes will let "proper licensing" stand in their way.

gnerd00 5 days ago

> Most LLMs focus on the English, German, French and Chinese languages, but everything else is... left behind at best.

that is not true, so please read before make an opinion. The French Mistral project shipped seven+ years ago with 140 languages for example.. language translation was the first LLM task from 2015

selcuka 5 days ago

One example is not the same as "most LLMs". My experience is the same with most LLMs. Especially the smaller ones are English oriented (probably makes sense given the size constraints).

jampekka 4 days ago

> Most LLMs focus on the English, German, French and Chinese languages, but everything else is... left behind at best

Current frontier models (closed and open) are already really good at small languages too. I use them in Finnish sometimes, and the language is immaculate. They underestand even somewhat obscure dialects. Multilinguality seems to be a mostly solved problem.

KronisLV 4 days ago

This already exists https://eurollm.io/

How do people not know about it and keep making stuff from scratch?

Alexander-Barth 4 days ago

I did not know about EuroLLM. I had a look to the paper (https://arxiv.org/abs/2602.05879) describing it:

Specifically, we discard documents shorter than 200 characters (Xue et al., 2021a), and any page containing the phrase “lorem ipsum,” the word “javascript,” or curly brackets (Raffel et al., 2023)....

It is quite surprising/funny to see all documents with javascript removed.

dr_dshiv 5 days ago

There is something north of 8% OCR error rates.. that will hurt model quality!

siva7 5 days ago

Uh, some would say it's easy to determine what input went into the training for kimi and qwen.. since they were caught stealing it from American labs. Some cultural cliches may never change.

ignoramous 5 days ago

> since they were caught stealing it from American labs. Some cultural cliches may never change.

Has a formal lawsuit been brought to bear? Given, Anthropic & OpenAI are being dragged through courts for copyright violation (or stealing, as you'd call it, if the companies involved were culturally Chinese) by newspapers, publishing houses etc; one'd think they'd pass on some of that medicine to Alibaba, which does have business entities registered in the US.

janc_ 5 days ago

It's well-known that all commercial models are based on stolen content. That doesn't mean there is no filtering/censoring, just that the censoring likely depends on where it's happening…

selcuka 5 days ago

> It's well-known that all commercial models are based on stolen content.

Does that mean that Chinese models are the "Robin Hood"s of the AI era?

basisword 4 days ago

>> Some cultural cliches may never change.

Let’s just gloss over the monstrous amount of copyrighted and pirated material the American labs trained on. China bad. American good. Some cultural cliches never change.

mschuster91 4 days ago

How about, both China and the US bad, Europe at least somewhat decent because we lack the financial incentives to behave like utter arseholes?

kouteiheika 4 days ago

> since they were caught stealing it from American labs

...and "good guys" the American labs were caught stealing from authors all over the world[1].

[1]: www.npr.org/2025/09/05/g-s1-87367/anthropic-authors-settlement-pirated-chatbot-training-material

j_french 4 days ago

.... Anthropic began buying books in bulk, tearing off the bindings and scanning each page before feeding the digitized versions into its AI model, according to court documents.

Wow. This image of Anthropic employees ripping books apart to use them to train models is a powerful one, seems like an inflection point in the history of information.