Hacker News new | ask | show | jobs
by ssgodderidge 888 days ago
> Imagine if Linux published only a binary without the codebase. Or published the codebase without the compiler used to make the binary. This is where we are today.

This was such a helpful way to frame the problem! Something felt off about the "open source models" out there; this highlights the problem incredibly well.

7 comments

In my mind, what's more crucial here is code for downloading/scraping and labeling the data, not the model architecture nor training script.

As much as I appreciate Mis(x)tral, I would've loved it even more if they released code for gathering data.

I'm speculating they are attempting to avoid controversy about their datasources. That and a possible competitive edge depending on what specific sets/filtering they're using.
To avoid controversy AND potential lawsuits.
Yup.

I think many countries (japan already has) will allow IP for training data.

They just need to buy time until then.

It’s common for third party model testers to not disclose what they mean by “Refusal” parameter as well, for obvious reasons. The world is full of witch-hunting maniacs now and will stay so for an indefinite amount of time. Just wait until the whole thing becomes more widely known and they realize. All AI companies have to hurry up before the doors shut.
IMHO much of the key training data can't simply be downloaded/scraped/labeled, no matter what code you had - it's not like it's freely accessible to everyone and just needs some code to get it and process it. You can't scrape all of Google Books archive or all of Twitter, and quite a few things that could be scraped at one point may actively prevent you from scraping them now.
I don't mind to have ready to use datasets instead the code for downloading/scraping and labeling. It will save a lot of time. It is not complicated to write some code for gathering the data, it might be sometimes impossible to replicate the datasets after all if some parts of the data which you have to scrape are already gone (removed because of various reasons).
I think a better analogy is firmware binary blobs in the Linux kernel, or VM bytecodes.

The LLM inference engine (architecture implementation) is like a kernel driver that loads a firmware binary blob, or a virtual machine that loads bytecode. The inference engine is open source. The problem is that the weights (firmware blobs, VM bytecodes) are opaque: you don't have the means to reproduce them.

The Linux community has long argued that drivers that load firmware blobs are cheating: they don't count as open source.

Still, the "open source" LLMs are more open than "API-gated" LLMs. It's a step in the right direction, but I hope we don't stop there.

If we're continuing the analogy, the compute required to turn the source into binaries costs millions of dollars. Not a license fee for the compiler, but the actual time on a computer.
The GPL describes the source as the "preferred form for modification".
And, that's obviously fun, because with LLMs, you have the LLM itself which cost hundreds of thousands in compute to train, but given you have the weights it's eminently fine-tunable. So it's actually not really like Linux - rather it's closer to something like a car, where you had no hope of making it in the first place but now you have it, maybe you can modify it.
So in this case, the weights are the source code and the training material + compute time is like the software development process that went into creating the source code.

It would probably take well over a million dollars in engineering hours to recreate the postgres source code from scratch, just as it would take millions in compute to rebuild the weights.

The model weights ARE the preferred form for modification
As long-time a 'practitioner' of machine learning models I strongly disagree, the preferred form for model modification is by retraining the model with a tweak to the parameters or the training algorithm or the model structure or data selection or length of training.

You can get some effects by fine tuning, and in that case it may be preferable as it's cheaper, but in general if I want to have a different or better model, that involves retraining.

I don’t really believe your long time practitioning is aligned to the kind of models being discussed
Yeah, that's why data scientists are out there editing the weights rather than cleaning up datasets and rerunning training with different settings.
If that was supposed to be clever it just sounds naive. There’s a ton of work going on fine tuning open source models
> There’s a ton of work going on fine tuning

... models provided in weights only form. (mostly!)

I believe the preferred form would be the whole kit and caboodle: the collection and filtering scripts, the data to the extent that it's non-public, the training routine, and the model weights... because sometimes you'll perform changes at any of those stages.

Do you actually do this for a living? Do you have experience doing this and have credibility talking about what’s preferred? I do.
Unless you want to try modifying the model structure, in which case the weights aren’t necessarily valid anymore and will need to be retrained.
The GNU GPLv3 requires "Corresponding Source", not only the files that contain lines such as "def foo(bar):" or "foo(bar)". The Corresponding Source includes all of the files needed to turn your unmodified/modified copy of the source code into something the user can run, with exceptions to some of the tools that the author of the GPLed program has no authorship in.

> The “Corresponding Source” for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work's System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work. For example, Corresponding Source includes interface definition files associated with source files for the work, and the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as by intimate data communication or control flow between those subprograms and other parts of the work.

...

> You may convey a covered work in object code form under the terms of sections 4 and 5, provided that you also convey the machine-readable Corresponding Source under the terms of this License

Model weights alone are not Corresponding Source. In order to distribute a model you made under the GPLv3, you would have to give users the model weights and the scripts needed to turn the model weights into a runnable model. That's assuming that you only work with the model weights when modifying the model. If you in particular retrain the model as part of modifying the model, then you would have to provide the training data and initial training scripts as well.

Even though I wrote about a particular free software license which happens to be an open source license, the open source definition from the Open Source Initiative also refers to the preferred form of changing the work [2]:

> The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.

For good measure, here is the relevant excerpt from the free software definition from the Free Software Foundation [3]:

> Obfuscated “source code” is not real source code and does not count as source code.

> Source code is defined as the preferred form of the program for making changes in. Thus, whatever form a developer changes to develop the program is the source code of that developer's version.

> Freedom 1 includes the freedom to use your changed version in place of the original. If the program is delivered in a product designed to run someone else's modified versions but refuse to run yours—a practice known as “tivoization” or “lockdown,” or (in its practitioners' perverse terminology) as “secure boot”—freedom 1 becomes an empty pretense rather than a practical reality. These binaries are not free software even if the source code they are compiled from is free.

The FSF's free software definition requires that the user be practically - not merely theoretically - allowed to modify the source code and turn the source code into a running program. Because of that, the free software definition considers build scripts to be part of the source code. I can't find an explicit analogue of the practically-modifiable requirement in the open source definition, but I think providing the model weights without providing the scripts needed to turn the weights into a functioning copy of the existing model would be obfuscation i.e. a violation of the open source definition.

[1] https://www.gnu.org/licenses/gpl-3.0.en.html

[2] https://opensource.org/osd/

[3] https://www.gnu.org/philosophy/free-sw.html

Off-topic but that's why I always fail to pick up android dev after so many false starts. It just never felt right.

Android is not open source.

No it’s not. You have everything you need to modify the models to your own liking. You can explore how it works.

This analogy is bad. Models are unlike code bases in this way.

> You have everything you need to modify the models to your own liking.

What if I wanted to train it using only half of its training set? If the inputs that were used to generate the set of released weights are not available I can’t do that. I have a set of weights and the model structure but without the training dataset I have no way of doing that.

To riff on the parent post, I have:

    Source + Compiler => Binaries
For the vast majority of open source models I have:

    [unavailable inputs] + Model Structure => Weights
They’re not exactly the same as the source code/binary scenario because I can still do this (which isn’t generally possible with binaries):

    Model Structure + Weights + [my own training data] => New Weights
Another way to look at it is that with source code I can modify the code and recompile it from scratch. Maybe I think the model author should have used a deeper CNN layer in the middle of the model. Without the inputs I can’t do a comparison.
> Maybe I think the model author should have used a deeper CNN layer in the middle of the model. Without the inputs I can’t do a comparison.

You can fine tune into a different model architecture.

You’re right on not being able to retrain the model from scratch on half its data without that data but that’s likely pointless.

I’d be happy to be wrong about this but my understanding is that changing the architecture of the last few layers is feasible with fine-tuning but changing middle layers isn’t likely going to work very well without having the full original input set.

> likely pointless

It doesn’t take too much creativity to come up with ideas about why someone might want to do that:

- researchers who want to investigate how much the dataset can be reduced (and thus training cost) and what the accuracy penalty is

- someone who wants to for either religious or ethical reasons minimize the probability that the model was trained on pornography

- someone who’s curious about whether there’s significant redundancy in the existing input datasets

- someone who’s curious about whether there are a much smaller subset of images in the input dataset that can quickly help the first few CNN input layers converge before training the middle and output layers on the larger dataset.

Edit: I suspect the real reason they don’t want to share the input dataset is purely because a high-quality annotated dataset is a valuable commodity. While I don’t do ML work myself day-to-day, I do work with a team that does in a very niche field and I can only imagine how much effort they had to go through to get the annotated dataset that they’ve put together. Even just collecting the images for it involved many hours of drone flights in different locales around North America in varying weather and lighting.

Original input set is irrelevant.

You will need some data of your own of course to fill in the blanks

Edit; however conversely, you can also splice out layers from one model into another original model. It’ll take some retraining, but this works!

You can do the same with binaries. Can modify those all you want.

Models are the compiler + makefiles. Dataset is the code.

I don't know about the OSI's open source definition [1] in general, but specific licenses might consider makefiles and build scripts to be part of the source code. (For what it's worth, the free software definition from the FSF does consider makefiles and build scripts to be part of the source code [2].)

[1] https://opensource.org/osd/

[2] https://www.gnu.org/philosophy/free-sw.html

No, it’s not the same. Yes, you can technically modify binaries, but it’s not at all the preferred way to modify the program.
Congratulations. You've almost finished understanding my comment.
Well you’ve failed and managed to be a dick
>Or published the codebase without the compiler used to make the binary.

A slightly offtopic complaint, but too often I have seen tutorials for open source stuff (coughopenglcough) where they don't provide the proper commands to compile and link everything required to build it. Figuring it out makes the "getting started" portion even more tedious.

Open Source and Free Software wasn't formulated to deal with the need for this level of gargantuan amounts of data and compute.

Can the public compete? What percentage of the technical public could we expect to participate, and how much data, compute, and data quality improvement could they bring to the table? I suspect that large corporations are at least an order of magnitude advantaged economically.

There is a big effort being worked on in China, Yuanqing Lin gave an interview on the deep learning course that works on this magnitude [1]. They suggest that they will host both the resources to store the data, train the data, and have all those algorithms available in China.

[1] https://www.youtube.com/watch?v=3GfOnI3goAk

The public doesn't have the resources to train the largest state-of-the-art LLMs, but training useful LLMs seems doable. Maybe not for most individuals but certainly for a range of nonprofits, research teams and companies.
Isn't is relatively easy for a smaller model to poke holes in the output of a larger model?
But not nearly as in reach as modifying open source models.
Open Source and Free Software are not about the amount of data.