Hacker News new | ask | show | jobs
by nine_k 494 days ago
To me, the ML situation looks roughly like this.

(1) Model weights are something like a bytecode blob. You can run it in a conformant interpreter, and be able to do inference.

(2) Things like llama.cpp are the "bytecode interpreter" part, something that can load the weights and run inference.

(3) The training setup is like a custom "compiler" which turns training data to the "bytecode" of the model weights.

(4) The actual training data is like the "source code" for the model, the input of the training "compiler".

Currently (2) is well-served by a number of open-source offerings. (1) is what is usually released when a new model is released. (1) + (2) give the ability to run inference independently.

AFAICT, Red Hat suggests that an "open-source ML model" must include (1), (2), and (3), so that the way the model has been trained is also open and reusable. I would say that it's great for scientific / applied progress, but I don't think it's "open source" proper. You get a binary blob and a compiler that can produce it and patch it, but you can't reproduce it the way the authors did.

Releasing the training set, the (4), to my mind, would be crucial for the model to be actually "open source" in the way an open-source C program is.

I understand that the training set is massive, may contain a lot of data that can't be easily released publicly but that were licensed for the training purposes, and that training from scratch may cost millions, so releasing the (4) is very often infeasible.

I still think than (1) + (2) + (3) should not be called "open-source", because the source is not open. We need a different term, like "open structure" or something. It's definitely more open than something that's only available via an API, or as just weights, but not completely open.

3 comments

It is really just “open use” with detailed defined by the license type (MIT, etc)
It's more than just use (inference), it does open some otherwise secret sauce of the training. It looks like there's no existing word / notion to exactly pinpoint this level of openness.
> Model weights are something like a bytecode blob

Can you update a bytecode blob as easily as finetuning and prompting models? It only takes a few input-output pairs and a few dollars worth of compute. They are more like an operating system and fine-tuning/prompting is like scripting on top. Similarly with Linux, you can download a LLM and run it locally.

I think these endless debates about whether open-weights models qualify for a particular piece of terminology are... tiring. That said, I think the debates would benefit from discussing model training and model inference as two separate systems, because that's what they are. It's possible for model training to be closed-source while model inference is open-source, and vice versa.

Consider recent Mistral-Small release. The model training is almost totally closed-source. You can't replicate it. However, the model inference is fully open source: the code and weights are Apache licensed. Not only that, but Mistral released both the base model and the instruction-tuned model, so you have a good foundation to work from (the base model) should you prefer to do your own instruction tuning. In fact, Mistral has also open-sourced code to aid in the fine-tuning process as well. So you really have everything you need* to use and customize this inference system. And for most practical purposes, even if you had the original training data, it would be of no use to you.

It's also worth considering the inverse scenario. Suppose Meta were to release a big blob of pre-training data and scripts for Llama 405B, but no weights. This clearly qualifies as open source, but it is basically useless unless you have many millions of dollars to do something with it. It would do very little to democratize access to AI.

* Asterisk: There is one situation where having access to the original training data would be really, really useful -- model distillation. Nobody can match Meta's ability to distill Llama 405B into an 8B size, because that process works best when you can do it on identically distributed data.

For me, the attacks on ML that are possibly by poisoning the training data preclude considering models without freely distributable and modifiable training as open-source or libre models.