Hacker News new | ask | show | jobs
by isusmelj 419 days ago
As someone in Europe, I sometimes wonder what’s worse: letting US companies use my data to target ads, or handing it to Chinese companies where I have no clue what’s being done with it. With one I at least get an open source model. The other is a big black box.
4 comments

Both are bad. If Europe does not develop local alternatives to ChatGpt or DeepSeek, it will (slowly) lose its sovereingty.
Europe is developing local alternative models such as Mistral
They're not open source. It's nice of Meta and Deepseek to offer up their models for download, but that doesn't make them open source.
Hard to be fully open source if you train on copyrighted material.

Anyway. Deepseek is the most open of the sota models.

Did they open their datasets already? Would be nice to have 'thinking' part.
Isn't this a bit of semantic lawyering? Open model weights are not the same as open source in a literal sense, but I'd go so far as to suggest that open model weights fulfill much of the intent / "soul" of the open source movement. Would you disagree with that notion?
> open model weights fulfill much of the intent / "soul" of the open source movement

Absolutely not. The intent of the open source movement is sharing methods, not just artifacts, and that would require training code and methodology.

A binary (and that's arguably what weights are) you can semi-freely download and distribute is just shareware – that's several steps away from actual open source.

There's nothing wrong with shareware, but calling it open source, or even just "source available" (i.e. open source with licensing/usage restrictions), when it isn't, is disingenuous.

> The intent of the open source movement is sharing methods, not just artifacts, and that would require training code and methodology.

That's not enough. The key point was trust. When executable can be verified by independent review and rebuild. It it cannot be rebuilt it can be virus, troyan, backdoor, etc. For LLMs there is no way to reproduce, thus no way to verify them. So, they cannot be trusted and we have to trust producers. It's not that important when models are just talking, but with tools use it can be a real damage.

Hm, I wouldn't say that that's the key point of open software. There are many open source projects that don't have reproducible builds (some don't even offer any binary builds), and conversely there is "source available" software with deterministic builds that's not freely licensed.

On top of that, I don't think it works quite that way for ML models. Even their creators, with access to all training data and training steps, are having a very hard time reasoning about what these things will do exactly for a given input without trying it out.

"Reproducible training runs" could at least show that there's not been any active adversarial RHLF, but seem prohibitively expensive in terms of resources.

Well, 'open source' is interpreted in different ways. I think the core idea is it can be trusted. You can get Linux distribution and recompile every component except for the proprietary drivers. With that being done by independent groups you can trust it enough to run bank's systems. The other options are like Windows where you have to trust Microsoft and their supply chain.

There are different variations, of course. Mostly related to the rights and permissions.

As for big models even their owners, having all the hardware and training data and code, cannot reproduce them. Model may have some undocumented functionality pretrained or added in post process, and it's almost impossible to detect without knowing the key phrase. It can be a harmless watermark or something else.

But there is also no publicly known way to implant unwanted telemetry, backdoors, or malware into modern model formats either (which hasn't always been true of older LLM model formats), which mitigates at least one functional concern about trust in this case, no?

It's not quite like executing a binary in userland - you're not really granting code execution to anyone with the model, right? Perhaps there is some undisclosed vulnerability in one or more of the runtimes, like llama.cpp, but that's a separate discussion.

The biggest problem is arguably at a different layer: These models are often used to write code, and if they write code containing vulnerabilities, they don't need any special permissions to do a lot of damage.

It's "reflections on trusting trust" all the way down.

I saw the argument that the source code is the preferred base to make changes and modifications in software, but in the case of those large models, the weights themselves are the preferred way.

It's much easier and cheap to make a finetune or LoRA than to train from scratch to adapt it to your use case. So it's not quite like source vs binary in software.

Meta models do not, they have use restrictions. At least deepseek does not.
It does not and I totally disagree with that. Unless we can see the code that goes into the model to stop of from telling me how to make cocaine, it's not the same sort of soul.
> With one I at least get an open source model. The other is a big black box.

It doesn't matter much as in both cases provider has access to you ins and outs. The only question is if you trust company operating the model. (yes, you can run local model, but it's not that capable)

US is capitalistic liberal democracy and China is one party capitalistic dictatorship. Make your choice.
The US tends towards dictatorship; due process is an afterthought, people disappearing off the streets, citizens getting arrested at the border for nothing, tourists getting deported over minute issues such as an iffy hotel booking, and that's just off the top of my head from the last 2 days.
You make it seem so binary. If you do enough research on the US you might change your mind. YES, I would still choose the US.