Hacker News new | ask | show | jobs
by dwa3592 4 days ago
I don't understand countries (especially governments) wanting to have their own models when there are already pretty solid open source (weights) models out there.

Countries should want control over _where_ the compute is happening rather than _what code_ is running.

What's wrong with a country hosting a Kimi, Qwen or GPT-Oss on their hardware for their government work purpose?

6 comments

An LLM is an encoding of a culture, a way of viewing the world.

They are not neutral technology, they are a direct representation of the training set that has been chosen and how they are aligned.

In many ways, they are ideology made code.

If we leave building them to the US and China, only their way of seeing things will be digitized.

I don't like the idea of that.

Yes and also, US and Chinese models are censored in different ways. US models are way too prudish for personal use in Europe because they're afraid to piss off religious investors. Chinese models are too censored on history and current affairs, eg the tiananmen massacre never happened stuff like that.
Chinese models aren't censored as much as you think, you can download the model and run it somewhere else and they will happily tell you about Tiananmen Square. Or heck, ask DeepSeek via Openrouter, it will do the same.

The censorship works kind of like with Fabel, it kicks in before the model responds.

There's an absolutely massive cultural and behavioural bias in those models. Models will suggest things like "go to the hospital" for things that require GP appointments, "just drive three hours" while it's faster to go places by train, and so on. They will do it in anglicised Dutch (compound words split, English-like grammar structures) that's perfectly understandable, but the cultural bias is there if you know to look for it.

Furthermore, the expertise in designing and training these models is valuable as well. The existing models are good as a starting point in terms of learning from previous mistakes, but we should not just let a handful of American and Chinese people keep the knowledge and expertise.

One problem with this particular project, though, is that copyright has been enforced for Dutch LLM training before, and the AI industry cannot exist without massive scale piracy, the likes of which has never been seen before. A lot of Dutch training material exists in pirated books that AI companies in countries that do not care about copyright have access to, but are exempted from the training set here. The impact of enforcing copyright on an AI model will be quite interesting to see.

It is not about the country but the language. Most llms have poor or no support for Dutch.
Idk which models you refer to, but I tested a bunch recently, and they performed well on Dutch. Only the smallest, such as qwen 3.6 27B, made up words and switched languages.
There's a large gap between making up words and an actually native text distribution. LLMs have a clear pattern, clear tells, a "feel" in English, and it's normally even more pronounced in non-English languages.

Lots of bias towards English sentence structure, idioms, etiquette, etc.

I didn't notice any of that. Such a bias would be strange, because certainly smaller models don't have the luxury of learning grammar independently: it's still word sequences, and languages are quite well separated.
There would be a bunch of value in having, say, a good 30B-class model that used my local language as well as it does English. There's lots of cases, especially in the government sphere, where local processing is a requirement and frontier-level capabilities aren't required. Making those cheap to run seems like a fine goal.
Can you provide some examples of these use cases?
Support bots and question answering with access to sensitive pii?
Yes, but what's the point of a support bot that writes good Dutch when it can't follow instructions, doesn't understand the questions or can't solve problems? I might be wrong, but I don't think atm these models have the cognitive ability to perform any task in a satisfactory manner.

As for accessing pii, I imagine the value here is in the fact they're local, which has nothing to do with the "sovereignty" of these models. If anything, a model is more likely to be tricked by a malicious prompt the farther it is from the sota.

I don't understand this. Even if that were true (and it isn't in my experience), a model that is trained on a Dutch corpus and arguably "knows Dutch well" but has the reasoning and comprehension abilities of a three year old is useless in any case. I'd rather use a model that can only speak English and put an automatic translator around it.
To be fair. There is a security concern angle: even open-source models could be trained as sleeper agents that act adversarially (for example, adding backdoors) when used in specific national companies in specific settings. This is very difficult to detect or void, so if you want to be sure 100% that this isn't the case, you have to train your own model from scratch.
Why should Dutch people be expected to make do with models 99% trained on American/Chinese cultural context and language?
Maybe the Dutch really really want an LLM that tells them the truth as straight as possible no matter how harsh - that might be tricky
Understood, but they could fine tune base models on their own cultural context and language. Why reinventing the wheel?
This gets better short-term results for a fraction of the cost, for sure, but what do you when China places an export control banning the release of open weight models? If you don't have your own talent, you're then relegated to using a base model from 2026 or whatever the cutoff date is, forever. That defeats the purpose of a 'sovereign' model made for and by your people.
They could apply the Polder Model of consensus decision making with a mixture of experts.

https://en.wikipedia.org/wiki/Polder_model

Funny, that's what I thought when PewDiePie set up his monster AI rig and what he called a 'council'. Quote:

"PewDiePie has built a custom web UI for self-hosting AI models called "ChatOS" that runs on his custom PC with 2x RTX 4000 Ada cards, along with 8x modded RTX 4090s with 48 GB of VRAM. Running open-source models from Baidu and OpenAI, PewDiePie made a "council" of bots that voted on the best responses, and then built "The Swarm" for data collection that will become the foundation of his own model coming next month."

https://www.tomshardware.com/tech-industry/artificial-intell...

Calling a bunch of LLMs a "council" is just rebranding well known ensemble methods with shrill marketing hype, nothing original or out of the ordinary. Mixture of experts and every other idea in his stack has a literature older than PewDiePie's career.

Yet another attention craving influencer who shilled crypto scams during the crypto bubble and is now marketing "AI councils" during the AI bubble.

He's not a serious or honest person, AI is just what he pivoted to after crypto. That's not innovation; it's attaching trendy branding to ideas that were already old when Marvin Minsky wrote The Society of Mind in 1986, three years before PewDiePie was zero years old in 1989.

The only thing PewDiePie's brought to the table is cleverly optimized YouTube thumbnails designed to attract clicks. The architecture is decades old; only his marketing and shilling is state of the art.

Lighten up dude.
I thought finetuning data can't contradict foundation models, and anything that are inconsistent with the standard LLM American-Chinese split personality would be rejected?
Fine tuning happens on top of pretraining, so of course it can "forget" pretrained defaults when warranted by the new data it's being fine tuned on.
But you have to have more data than used for pretraining for the added knowledge to take precedent over pretraining, no? If that would be the case, you practically contradict the knowledge in the base model.

I mean ... LLMs are sort of an extreme and living proof of linguistic determinism. Their behaviors are dictated almost entirely by disorganized language data, primarily English and Chinese. So you can't just add a language as native primary language in a quick post training, I think. There's no way that it would work.

Oh, it's all fine with cultural context here -- we don't even dub English language movies here because we are that cheap
Really? Because I'm pretty sure that at least every two days there's a active post with a top voted comment along the lines of "The EU isn't doing AI themselves, they are so hosed".