Hacker News new | ask | show | jobs
by nullc 888 days ago
The GPL describes the source as the "preferred form for modification".
3 comments

And, that's obviously fun, because with LLMs, you have the LLM itself which cost hundreds of thousands in compute to train, but given you have the weights it's eminently fine-tunable. So it's actually not really like Linux - rather it's closer to something like a car, where you had no hope of making it in the first place but now you have it, maybe you can modify it.
So in this case, the weights are the source code and the training material + compute time is like the software development process that went into creating the source code.

It would probably take well over a million dollars in engineering hours to recreate the postgres source code from scratch, just as it would take millions in compute to rebuild the weights.

The model weights ARE the preferred form for modification
As long-time a 'practitioner' of machine learning models I strongly disagree, the preferred form for model modification is by retraining the model with a tweak to the parameters or the training algorithm or the model structure or data selection or length of training.

You can get some effects by fine tuning, and in that case it may be preferable as it's cheaper, but in general if I want to have a different or better model, that involves retraining.

I don’t really believe your long time practitioning is aligned to the kind of models being discussed
Yeah, that's why data scientists are out there editing the weights rather than cleaning up datasets and rerunning training with different settings.
If that was supposed to be clever it just sounds naive. There’s a ton of work going on fine tuning open source models
> There’s a ton of work going on fine tuning

... models provided in weights only form. (mostly!)

I believe the preferred form would be the whole kit and caboodle: the collection and filtering scripts, the data to the extent that it's non-public, the training routine, and the model weights... because sometimes you'll perform changes at any of those stages.

Do you actually do this for a living? Do you have experience doing this and have credibility talking about what’s preferred? I do.
OK. Where is your reproduction of Pythia trained from scratch? Or MPT? Or Amber? Shall we play a game where you give paper regarding pretraining (and we are not taling about puny models based on wikitext2) I give you a paper based around finetuning and we'll see who run out of papers first?
Unless you want to try modifying the model structure, in which case the weights aren’t necessarily valid anymore and will need to be retrained.
The GNU GPLv3 requires "Corresponding Source", not only the files that contain lines such as "def foo(bar):" or "foo(bar)". The Corresponding Source includes all of the files needed to turn your unmodified/modified copy of the source code into something the user can run, with exceptions to some of the tools that the author of the GPLed program has no authorship in.

> The “Corresponding Source” for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work's System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work. For example, Corresponding Source includes interface definition files associated with source files for the work, and the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as by intimate data communication or control flow between those subprograms and other parts of the work.

...

> You may convey a covered work in object code form under the terms of sections 4 and 5, provided that you also convey the machine-readable Corresponding Source under the terms of this License

Model weights alone are not Corresponding Source. In order to distribute a model you made under the GPLv3, you would have to give users the model weights and the scripts needed to turn the model weights into a runnable model. That's assuming that you only work with the model weights when modifying the model. If you in particular retrain the model as part of modifying the model, then you would have to provide the training data and initial training scripts as well.

Even though I wrote about a particular free software license which happens to be an open source license, the open source definition from the Open Source Initiative also refers to the preferred form of changing the work [2]:

> The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.

For good measure, here is the relevant excerpt from the free software definition from the Free Software Foundation [3]:

> Obfuscated “source code” is not real source code and does not count as source code.

> Source code is defined as the preferred form of the program for making changes in. Thus, whatever form a developer changes to develop the program is the source code of that developer's version.

> Freedom 1 includes the freedom to use your changed version in place of the original. If the program is delivered in a product designed to run someone else's modified versions but refuse to run yours—a practice known as “tivoization” or “lockdown,” or (in its practitioners' perverse terminology) as “secure boot”—freedom 1 becomes an empty pretense rather than a practical reality. These binaries are not free software even if the source code they are compiled from is free.

The FSF's free software definition requires that the user be practically - not merely theoretically - allowed to modify the source code and turn the source code into a running program. Because of that, the free software definition considers build scripts to be part of the source code. I can't find an explicit analogue of the practically-modifiable requirement in the open source definition, but I think providing the model weights without providing the scripts needed to turn the weights into a functioning copy of the existing model would be obfuscation i.e. a violation of the open source definition.

[1] https://www.gnu.org/licenses/gpl-3.0.en.html

[2] https://opensource.org/osd/

[3] https://www.gnu.org/philosophy/free-sw.html