Hacker News new | ask | show | jobs
by TaylorAlexander 931 days ago
Recently there has been a trend in calling models with weights and code available "open source" even if the training data is not available. For safe deployment in health care and other safety critical fields, transparency on the training data and process are vital too, which means developing clear terminology for models full transparency! Even this article title suffers from this ambiguity.
4 comments

Yeah it's a pretty obvious misuse of the term. Training data is (part of) the "source"; weights are clearly "binaries". Training is "compiling".
That's why EU's upcoming AI regulation requires foundational models to have full documentation , including detailed descriptions of training data etc.
I can't fathom why they didn't just require the models to make available the training data itself. Sure you might need to fork some cash so they can ship you hard drives but surely being audited by someone anyone is better than none.
Training data may be licensed from third parties which don't allow redistribution.
Training data for a medical diagnosis model would likely include enough info to de-anonymize the info for some participants (age, sex, zip code, descriptions). I'm not sure what the answer should be but I'm uncomfortable with the medical training data being provided freely to the world.
If you give up your training data, you don’t have a product anymore.
Eventually it will become the norm (as it is around these parts) for the Ai to provide sources for its wild claims.

If the answer is "Neck bone connected to the head bone" im going to want to see a source that isnt Dem Bones

There is still a large difference between an AI being able to cite it's claims, and having the original training data and exact code and process in which to convert the training data into the weights used in inference.

If you cannot re-create the weights and model used for inference, a release's value is somewhat limited vs releases where the inference model can be re-created. (It's kind of like the limited value of scientific papers where the results cannot be reproduced due to a lack of detail)

What would you do with the training data if you had it? I see absolutely no reason why the training data is needed to evaluate a model, or how any kind of guarantees could be made about the model if you did have the training data. With the weights and code it's perfectly possible to interrogate and evaluate it.

I suspect a lot of people asking for training data are mainly looking to complain about some aspect of it (bias, copyright, etc etc) instead of actually thinking they can somehow use it to devine how the model will perform.

One can never practically evaluate it on all possible inputs/prompts, so an understanding of the training data distribution is important to generate the right test queries and create guardrails for desired use cases.