Hacker News new | ask | show | jobs
by injidup 56 days ago
You got it the wrong way round. It's more akin to.

1. Training data is the source. 2. Training is compilation/compression. 3. Weights are the compiled source akin to optimized assembly.

However it's an imperfect analogy on so many levels. Nitpick away.

1 comments

It's dataset [0] released under some source available license or OSI license, ie. open dataset or open source dataset.

[0] https://news.ycombinator.com/item?id=47758408

So is it open dataset or open source dataset?

Eg. it is no accident Creative Commons is using different terminology for non-software works.

"Open Source" is normally reserved for OSI approved licenses but there are many non-OSI approved, source available licenses as well.

For example gemma4 is released under Apache 2.0 license – and can be called open source dataset.

On the other hand ie. deepseek, while publicly available weights model, is not released under OSI approved license, they released it under their own "Deepseek License Aggreement" – ie. in general it's free to use as normal OSI license but has some restrictions, ie. military use is explicitly forbidden.