Hacker News new | ask | show | jobs
by wlesieutre 888 days ago
My main concern is that if all you have are weights you're stuck hoping for the benevolence of whatever organization is actually able to train the model with their secret dataset.

When they get bought by Oracle and progress slows to a crawl because it's not profitable enough to interest them, you can't exactly do a LibreOffice. Or they can turn around and say "license change, future versions may not be used for <market that controlling company would like to dominate>" and now you're stuck with whatever old version of the model while they steamroll your project with newer updates.

Open weights are worth nothing in terms of long term security of development, they're a toy that you can play with but you have no assurances of anything for the future.

1 comments

Everything you just said applies to normal software. Oh no! Big Corp just started a closed fork of their open source codebase! Well, the open source version is still there. The open source community can build off of it.

You may complain that subsequent models are not iterative on the past and so having that old version doesn’t help; but then the data probably changes too so having the old data would largely leave you with the same old model.

When you train an updated model on a new dataset do you really start by deleting all of the data that you collected for training the previous version?
Probably not. But if it’s the new data providing the advantage then you’re not exactly better off having the old data and the model vs. just having the model.
The idea would be that another group could fork it and continue adding to the dataset on their own.

As opposed to not being able to fork it at all because an "open source" model actually just means "you are allowed to use this particular release of our mystery box."

You do not need the original dataset to train the model on an additional dataset

Maybe I misunderstood your original question. To be clear, the process of modifying a trained model does not require the presence of the original data. You said “deleted” which perhaps I misinterpreted. You’re not “instantiating a new model from scratch” when you modify it. You’re continuing to train it where it left off.

What if you want to start with a subset of the original data? Like you've trained a model, and then later said "You know, this new data we're adding is great, but maybe pulling all those comments from 4chan earlier was a mistake," wouldn't that require starting fresh with access to the actual data?