Hacker News new | ask | show | jobs
by piperswe 390 days ago
But where's the source? I just see a binary blob, what makes it open source?
4 comments

The weights are the source. It isn't as though something was compiled into weights. They're trained directly. But I know what you mean, it would be more open to have the training pipeline and souce dataset available.
The weights seem much more like a binary to me, the training pipeline the compiler, and the training dataset the source.
Come here to write this - perfect analogy!
It's very imperfect analogy though these things can't be rebuilt "from scratch" like a program, the training process doesn't seem to be replicable anyway. Nonetheless, full data disclosure is necessary, according to the result of the years-long consultation led by the Open Source Initiative https://opensource.org/ai
> the training process doesn't seem to be replicable anyway

The training process is fully deterministic. It's just an algorithm. Feed the same data in and you'll get the same weights out.

If you're speaking about the computational cost, it used to be that way for compilers too. Give it 20 years and you'll be able to train one of today's models on your phone.

> The training process is fully deterministic. It's just an algorithm. Feed the same data in and you'll get the same weights out.

No it is not. The training process is non-deterministic, and given exactly the same data, the same code and the same seeds you'll get different weights. Even the simplest operations like matrix multiplication will give you slightly different results depending on the hardware you're using (e.g. you'll get different results on CPU, on GPU from vendor #1 and on GPU from vendor #2, and probably on different GPUs from the same vendor, and on different CUDA versions, etc.), but also depending on the dimensions of the matrices you'll get different results (e.g. if you fuse the QKV weights from modern transformers into a single matrix and do a single multiplication instead of multiplying each separately you'll get different results), and some algorithms (e.g. backwards pass of Flash Attention) are explicitly non-deterministic to be faster.

A lot of quibbling here, wasn't sure where to reply. If you've built any models in PyTorch, then you know. Conceptually it is deterministic, a model trained using deterministic implementations of low level algorithms will produce deterministic results. And when you are optimizing the pipeline, it is common to do just that:

    torch.manual_seed(0)
    random.seed(0)
    np.random.seed(0)
    torch.use_deterministic_algorithms(True)
But in practice that is too slow, we use nondeterministic implementations that run fast and loose with memory management and don't necessarily care about the order in which parallel operations return.
I’m pretty sure the initial weights are randomized meaning no two models will train in the same way twice. The order in which you feed in training data to the model would also add an element of randomness. Model training is closer to growing a plant than running a compiler.
What makes models non-deterministic isn't the training algorithm, but the initial weights being random.

Training is reproducible only if, besides the pipeline and data, you also start from the same random weights.

Can you point at the research that says that the training process of a LLM at least the size of OLMo or Pythia is deterministic?
You can fine-tune their weights and release your own take.

E.g. see all the specialized third-party models out there based on Qwen.

"Open-source" is the wrong word here, what they mean is "you can modify and redistribute these weights".

You can also reverse engineer and modify closed source programs (see mods for games). Weights are like compiled version of source data.
Finetuning isn't reverse engineering. Finetuning is a standard supported workflow for these models.

Also, the "redistribute" part is key here.

> Finetuning isn't reverse engineering

Fully agree, it isn't. Reverse engineering isn't necessary for modifying compiled program behaviour, so comparing it to finetuning is not applicable. Finetuning applied to program domain would be more like adding plugins or patching in some compiled routines. Reverse-engineering applied to models would be like extracting source documents from weights.

> Finetuning is a standard supported workflow for these models.

Yes, so is adding mods for some games, just put your files in a designated folder and game automatically picks it up and does required modifications.

> Also, the "redistribute" part is key here.

It is not. Redistributability and being open source is orthogonal. You can have a source for a program and not be able to redistribute source or program, or you can redistribute a compiled program, but not have it's source (freeware).

Not legally. That's the difference.
Sure you can. It's often legally protected activity. You're just limited to distributing your modifications without the original work.
For some games maybe, but software often has a clause forbidding reverse engineering
ChatGPT says that such clauses are typically void in the EU, though they may apply in some cases in the US. Even in the US, the triennial DMCA rule-making has granted broader exemptions for good-faith security research every cycle since 2016.

https://chatgpt.com/share/6838c070-705c-8005-9a88-83c9a5550a...

There is work to try to reproduce (the original) R1: https://huggingface.co/open-r1
I won't call it "binary blob". Safetensors is just a simple format for storing tensors safely: https://huggingface.co/docs/safetensors/index