Hacker News new | ask | show | jobs
by andy99 888 days ago
I don't agree, and the analogy is poor. One can do the things he lists with a trained model. Having the data is basically a red herring. I wish this got more attention. Open/free software is about exercising freedoms, and they all can be exercised if you've got the model weights and code.

https://www.marble.onl/posts/considerations_for_copyrighting...

2 comments

But one of the four freedoms is being able to modify/tweek things, including the model. If all you have is the model weights, then you can't easily tweak the model. The model weights is hardly the preferred form for making changes to update the model.

The equivalent would be someone which gives you only the binary to Libreoffice. That's perfectly fine for editing documents and spreadsheets, but suppose you want to fix a bug in Libreoffice? Just having the binary is going to make it quite difficult to fix things.

Simiarly, suppose you find that the model has a bias in terms of labeling African Americans as criminals; or women as lousy computer programmers. If all you have is the model weights of the trained model, how easily can you fix the model? And how does that compare with running emacs on the Libreoffice binary?

If all you have are the model weights, you can very easily tweak the model. How else are all these "decensored" Llama2 showing up on Hugging Face? There's a lot of value in a trained LLM model itself and it's 100% a type of openness to release these trained models.

What you can't easily do is retrain from scratch using a heavily modified architecture or different training data preconditioning. So yes, it is valuable to have dataset access and compute to do this and this is the primary type of value for LLM providers. It would be great if this were more open — it would also be great if everybody had a million dollars.

I think it's pretty misguided to put down the first type of value and openness when honestly they're pretty independent, and the second type of value and openness is hard for anybody without millions of dollars to access.

Well, by that argument it's trivially easy to run emacs on a binary and change a pathname --- or wrap a program with another program to "fix a bug". Easy, no?

And yet, the people who insist on having source code so they can edit the program and recompile it have said that for programs, having just the binary isn't good enough.

>suppose you find that the model has a bias in terms of labeling African Americans as criminals; or women as lousy computer programmers. If all you have is the model weights of the trained model, how easily can you fix the model?

That's textbook fine-tuning and is basically trivial. Adding another layer and training that is many orders of magnitude more efficient than retraining the whole model and works ~exactly as well.

Models are data, not instructions. Analogies to software are actively harmful. We do not fix bugs in models any more than we fix bugs in a JPEG.

Instructions is exactly what weights are. We just have no idea what those instructions are.
You can fine tune a model, you ve got way more power to do so given the trained model than starting from scratch and the raw data.
Next step will be to ask for GPU time. Because even with data, model code and training framework you may have no resources to train. "The equivalent would be" someone gives you the code, but no access to mainframe which is required to compile. Which would make it not open source(?) There are other variations, like original compiler was lost, current compilers aren't backward compatible. Does that make old open source code closed now?

In other words there should be a reasonable line when model is called open source. In extreme view it's when the model, the training framework, and the data are available for free. This would mean open source model can be trained only on public domain data. Which makes class of open source models very, very limited.

More realistic is to make the code and the weights available. So that with some common knowledge new model can be trained, or old fine tuned, on available data. Important note: weights cannot be reproduced even if original training data is available. It will be always a new model with (slightly) different responses.

Down voted, hmm... I'll add bit more then. Sometimes it's even good that model cannot be easily reproduced. Original developers usually have some skills and responsibility. While 'hackers' don't. It's easy to introduce bias into the data , like removing selected criminal records, and then publish model with similar name. That would be confusing, some may mistake fake one for the real.

PS: If I ever make my models open I can't open the data anyway. License on images directly prohibits publishing them.

My main concern is that if all you have are weights you're stuck hoping for the benevolence of whatever organization is actually able to train the model with their secret dataset.

When they get bought by Oracle and progress slows to a crawl because it's not profitable enough to interest them, you can't exactly do a LibreOffice. Or they can turn around and say "license change, future versions may not be used for <market that controlling company would like to dominate>" and now you're stuck with whatever old version of the model while they steamroll your project with newer updates.

Open weights are worth nothing in terms of long term security of development, they're a toy that you can play with but you have no assurances of anything for the future.

Everything you just said applies to normal software. Oh no! Big Corp just started a closed fork of their open source codebase! Well, the open source version is still there. The open source community can build off of it.

You may complain that subsequent models are not iterative on the past and so having that old version doesn’t help; but then the data probably changes too so having the old data would largely leave you with the same old model.

When you train an updated model on a new dataset do you really start by deleting all of the data that you collected for training the previous version?
Probably not. But if it’s the new data providing the advantage then you’re not exactly better off having the old data and the model vs. just having the model.
The idea would be that another group could fork it and continue adding to the dataset on their own.

As opposed to not being able to fork it at all because an "open source" model actually just means "you are allowed to use this particular release of our mystery box."

You do not need the original dataset to train the model on an additional dataset

Maybe I misunderstood your original question. To be clear, the process of modifying a trained model does not require the presence of the original data. You said “deleted” which perhaps I misinterpreted. You’re not “instantiating a new model from scratch” when you modify it. You’re continuing to train it where it left off.