| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by _flux 170 days ago

So we are on agreement that "weights" are not source code. Training data might not also be actual "code", but it is source. After all, the model trained using that data tries to estimate its training data. It is the ground truth for the model.

About the access of binaries or providing working implementations, where did those come from? I don't think this thread was discussing those at all.

Indeed I would be willing to call something an "open source model" if it came without weights, but did come with the training data and with a documented process (preferably executable); and a release with just the training data could be called "open dataset" while the software to run the training would be just plain old open source software.

And, of course, a model with only the model data distributed with an open license is relatively commonly called "open weights", this being pretty self-explanatory term.

1 comments

ecb_penguin 169 days ago

It is absurd to think that releasing open source code also requires releasing thousands of terabytes of Twitter and Reddit posts.

You already have access to all the training data everyone else is using.... You can download an offline version of Wikipedia. Here's every Reddit comment for a decade: https://academictorrents.com/details/ba051999301b109eab37d16...

_flux 169 days ago

I mean no, you don't need to be open source at all. Just don't release the data and call the release "open weights". Or do release the data, and the training process, and call yourself "open source".

Though, I do think it's still acceptable if you just point how to get the data (i.e. if it was the offline version of Wikipedia and then URL to that) if actually providing the source data is overwhelming. Offering to provide a copy at cost would be quite acceptable (i.e. I deliver the media to you to make a copy).

But if there's no way another person can acquire that data, even in theory, then I think it's pretty clear the source was not open. Just use the more appropriate term and everyone is on the level what the release is about.