Hacker News new | ask | show | jobs
by blackeyeblitzar 496 days ago
Disappointing that red hat is basically validating open weights as open source, and excusing it by saying this:

> The majority of improvements and enhancements to AI models now taking place in the community do not involve access to or manipulation of the original training data. Rather, they are the result of modifications to model weights or a process of fine tuning which can also serve to adjust model performance.

Well yes, because they have no access to anything more. With training source code and data they might do something different. If you don’t have all the things used to produce the final result, it’s not open source.

3 comments

Do you believe that open source can exist on top of closed hardware? I ask because you can't produce the final result without having someone give you the firmware blob. To me, this seems like an analogue to building on top of open weight models.
The math underpinning an AI model exists independent of the hardware it's realized on. I can train a model on one GPU and someone else can replicate my results with a different GPU running different drivers, down to small numerical differences that should hopefully not have major effects.

Data isn't fungible in the same way: I can't just replace one dataset with another for research where the data generation and curation is the primary novel contribution and expect to replicate the results.

There's also a larger accountability picture: just like scientific papers that don't publish data are inherently harder to check for statistical errors or outright fraud, there's a lot of uncomfortable trust required for open-weight closed-data models. How much contamination is there for the major AI benchmarks? How much copyrighted data was used? How can we be sure that the training process was conducted as the authors say, whether from malfeasance or simple mistakes?

i have very little knowledge of any of this, but i had an impression that OpenAI was trained on commodity cloud hardware that's available for purchase/rent to anyone, including off-the-shelf GPUs from Nvidia and AMD? are those what you are referring to as "the firmware blob", or was there some other, more specialized and custom-built closed hardware involved?
Turing completeness makes it a different problem.
"Do you believe that open source can exist on top of closed hardware? "

Yes, if Hardware is developed against standards shared by multiple manufacturers like amd64

It's not exactly practical to hand out the training material given the sheer quantity of data we're talking about.
GPL v2 and earlier let you charge distribution costs (v3's language is more complicated). In the late 80s you could order an Emacs tape from the FSF for $150, which is about $430 today!
But they could provide training code and let people provide their own Common Crawl (or whatever other pile of training data), couldn't they?
Yeah, no. We can move an arbitrary amount of data around the world at breakneck speed. Netflix does this for a living. It's not practical to hand out the training material because of the massive rampant copyright violations.
If a research group downloads material in order to train a model, is there some significant difference in copyright violation if they hand it to a second research group in order to fulfill the same purposes?
Yes, because of a key word in a lot of copyright laws... "distribution". Using that copyrighted material themselves to train the model still gives them plausible deniability. Handing the copyrighted material to another group starts to run afoul of other laws and also removes the plausible deniability that the original group can claim regarding their training data.
The training data is not necessarily kept. It's possible that data is consumed, incorporated into the weights and then discarded.
If only we'd figured out a technology that let us move huge torrents of bits around.

If only there was a catchy name for it. Something like bit-torrent perhaps?

I understood that it meant source code in addition to weights, as in "publishing programming language code does not suffice as open source if you do not publish weights"