Hacker News new | ask | show | jobs
by godelski 624 days ago
I don't think this makes sense nor is consistent with itself, let alone its other definition[0]

  > The aim of Open Source is not and has never been to enable reproducible software.
  ...
  > Open Source means giving anyone the ability to meaningfully “fork” (study and modify) a system, without requiring additional permissions, to make it more useful for themselves and also for everyone. 
  ...
  > Forking in the machine learning context has the same meaning as with software: having the ability and the rights to build a system that behaves differently than its original status. Things that a fork may achieve are: fixing security issues, improving behavior, removing bias.
For these things, it does mean what most people are asking for: training details.

So far companies are just releasing checkpoints and architecture. It is better than nothing and this is a great step (especially with how entrenched businesses are[1]). But if we really want to do things like fixing security issues or remove bias, you have to be able to understand the data that it was originally trained on AND the training procedures. Both of these introduce certain biases (via statistical definition, which is more general). These issues can't all be solved by tuning and the ability to tune is significantly influenced by these decisions.

The reason we care about reproducible builds is because it matters to things like security, where we know what we're looking at is the same thing that's in the actual program. It is fair to say that the "aim" isn't about reproducible software, but it is a direct consequence of the software being open source. Trust matters, but the saying is "trust but verify". Sure, you can also fix vulns and bugs in closed source software, hell, you can even edit or build on top of it. But we don't call these things open source (or source available) for a reason.

If we're going to be consistent in our definitions, we need to understand what these things are at at least a minimal level of abstraction. And frankly, as a ML researcher, I just don't see it.

That said, I'm generally fine with "source available" and like most people use it synonymous with "open source". But if you're going to go around telling everyone they're wrong about the OSS definition, at least be consistent and stick to your values.

[0] https://opensource.org/osd

[1] Businesses who's entire model depends on OSS (by OS's definition) and freely available research

1 comments

"Reproducible build" is a term used to refer to getting an exact binary match out of a build. This is outside the scope of the OSD. I am not certain, but it sounds like this is what they are talking about here. Just because you run the build yourself doesn't mean you will get an exact match of what the original producer built. Something as simple as a random number generator or using a timestamp in the build will result in a mismatch.

  > "Reproducible build" is a term used to refer to getting an exact binary match out of a build.
I'm not sure what makes you think I failed to understand this. Allow me to quote myself

  >> It is fair to say that the "aim" isn't about reproducible software, but it is a direct consequence of the software being open source.
But also, my entire point is not really about the reproducible build aspect. It is that if we're going to draw an analogy then the training and data IS the source. At worst, we'd say it is the build instructions.

But maybe I don't understand Open Source. Is it still Open Source if I provide the source code, an Apache License, but the code is in my own custom language (for fun, let's say it reads like brainfuck) and I have no released the compiler? Maybe some people would call this Open Source, but I imagine it would ruffle a lot of feathers. Is there really a meaningful difference between that an a binary? If it does fit "the letter of the law" it most certainly does not fit "spirit of the law". It is the spirit of the law that matters, because it is the whole fucking point.