Hacker News new | ask | show | jobs
by NitpickLawyer 634 days ago
> Wow, an actual open source language model

I find it funny that the AI field has somehow normalised the goalpost moving from capabilities all the way to definitions about open source. And people seem really tribal about it...

There absolutely are open source LLMs already. Phi3.5 (MIT), various Mistral models (Apache2.0), various Qwen2 models (Apache2.0) and so on. LLamas are not open source, nor are Gemmas. But to say this is "an actual open source model" is weird nitpicking for the sake of nitpicking, IMO.

Requiring the methods and datasets that someone used to create some piece of IP is in no way a requirement for open sourcing said IP. It never has been!

Imagine this analogy:

A dev comes up with a way to generate source code that solves a real problem. This dev uses a secret seed, that only they know. The dev also uses thousands of hours of compute, and an algorithm that they created. At the end of the exercise they release the results on github, as follows:

- here is a project that takes in a piece of text in english, and translates it into french.

- the resulting source code is massive. 10 billions LOC. The lines of code are just if statements, all the way down, with some hardcoded integer values.

- source code licensed under Apache 2.0, written in let's say python.

- users can see the source code

- users can run the source code

- users can modify the source code and re-release the code

Now, would anyone pre LLMs say "this isn't true open source" because it's too complicated? Because no one can reasonably understand the source code? Because it uses hard coded int values? Because it's 10b LOC? Because the dev never shared how they got those values?

Of course not. The resulting code would have been open source because Apache 2.0 is open source.

It's the same with model weights. Just because they're not source code, and just because you don't know how they were created, it does not mean the weights are not open source.

You can see the weights. You can change the weights. You can re-distribute the weights. It's open source. The definition of something being open source does not cover you understanding why the weights are like they are. Nor do they require you having access to the methods of creating those weights. Or datasets. Or whatever the devs had for breakfast.

3 comments

Great, with that definition we can call all binaries opensource !
This is the greatest misconception in this field. Weights are not a binary form! In fact you can't "run" the weights as they are. They only represent some fixed values.

Whenever you use an LLM you "load" the weights, using (usually open source) code and you run inference with that code. The weights are not binary and the analogy to the binary form of distributing software is not valid, IMO.

That is why I used the analogy of a python code with ifs all the way, based on hardcoded values. That is what you are arguing is not open source. The weights are just "hardcoded values".

Open source never had the requirement of the author explaining what, why or how they got a hardcoded value in their shared code. Why it suddenly does for LLMs is what I find funny.

By that argument, all bytecode is open source, because it has to be run in some other environment, and you can technically modify it if you want to. Open source is supposed to refer to the human-interpretable elements of the code. E.g., kernel modules that are technically formatted as C code but contain non-human readable firmware as values are still considered "binary blobs" and not part of the free/open source kernels some distros ship.
I completely disagree with you. The fundamental problem with your concept of open source is it goes against what open source really is. The ability for you to completely change what a piece of software can do. IMO, even with LLMs, models are "executables" and weights are "configuration". Yes, of course you can tune the weights by changing the values, but that's the most I can do. Can I actually add "features" to the model? Perhaps you "open-sourced" an LLM model trained on the United States Constitution. Can I change the model to then be a specialist in real estate law? Not with weights. I need it to learn case histories to extend its "feature-set". Without data and the mechanism to reproduce the model, how is this "open-source"?
> Can I actually add "features" to the model?

Yes. You can use a number of libraries to add, mix, merge, etc. layers [1]

> Not with weights. I need it to learn case histories to extend its "feature-set".

Again, yes. You can add attention heads, other features, heck you can even add classification if you want [2]. Because you are working with an open architecture! What you think of weights are not binary blobs. That is a common missconception.

[1] - https://github.com/arcee-ai/mergekit

[2] - https://github.com/center-for-humans-and-machines/transforme...

At first glance, that just seems like a bunch of libraries linked together to form a binary. That is not open-source. I completely agree with you that there is just not enough clarity out there. For my education, following up with my earlier example, can I remove the layers that have references to all chapters / laws in the constitution except for the ones meant for real-estate? How would I do that with the approaches you mentioned here?

Fundamentally, if I have to "reverse-engineer" something, then it's not open-source.

You would have to do the same fine-tuning as if you had the training data.
> that the AI field has somehow normalised the goalpost moving from capabilities all the way to definitions about open source

The problem is that Facebook and others are trying to move the goalpost, while others like me would like the goalpost to remain where it is, namely we call projects "Open source" when the required parts to build it on our own machines, is sufficiently accessible.

As I probably wouldn't be a developer in the first place if it wasn't for FOSS, and I spend literally all day long contributing to others FOSS projects and working on my own, it's kind of scary seeing these large companies trying to change what FOSS means.

I think you're forgetting about the intent and purpose of open source. The goal is that people can run software for whatever purpose they want, and they can modify it for whatever purpose. This is the intent behind the licenses we use when we "create FOSS".

This means, in practice, that the source code has to be accessible somehow, so the compiler I have on my computer, can build a similar binary to the one the project itself offers (if it does). The source code has to be accessible so I can build the project, but also modify it for myself.

Taking this idea that mostly only applied to software before (FOSS) but applying it to ML instead, it's clear to see what we need in order to 1) be able to use it as we want and 2) be able to modify it as we want.

> You can see the weights. You can change the weights. You can re-distribute the weights. It's open source.

Right. If I upload a binary to some website, you can see the binary, you can change the binary and you can re-distribute it. Would you say the binary is open source?

The weights are the binary in ML contexts. It's OK for projects to publish those weights, but it's not OK to suddenly change the definition and meaning of open source because companies want to look like they're doing FOSS, when in reality they're publishing binaries without any ways of building those binaries with your own changes.

Imagine if the Linux kernel was just a big binary blob. Yes, you can change it, re-distribute and what not, but only in a binary-blob shape. You'd be kind of out there if you insist on calling this binary-blob kernel FOSS. I'm sure you'd be able to convince some Facebook engineers about it, seems they're rolling with that idea already, but the rest of us who exist in the FOSS ecosystem? We'd still have the same goalpost in the exact same spot it's been for at least two decades I've been involved.

> Would you say the binary is open source?

Great question. Is the assembly code in a git, with an open source license? Then yes! It's open source!

Think about it this way: just because someone wrote hello world in c and then a compiler translated that into assembly, doesn't invalidate the quality of that assembly code being open source! That's the point. Something is open source or not if the resulting stuff is published under an open source license. Can you see the assembly code? Can you change it? Can you re-publish it? If all of these are yes, then it's open source!

> Imagine if the Linux kernel ...

That is semantics. The linux kernel is published in c because it's easier for people to reason in that abstracted language, but it would not suddenly become "closed source" if it were written in asm, assuming it would still be published under an open source license.

In other words, you having access to the "dataset" would not make the weights any easier to work with. They would still be in a "blob" as you call it.

> Think about it this way: just because someone wrote hello world in c and then a compiler translated that into assembly, doesn't invalidate the quality of that assembly code being open source!

Meanwhile:

> The source code must be the preferred form in which a programmer would modify the program.

https://opensource.org/osd

Then, given the fact that both you and Mistral LLC modify the program in the exact same way, that portion still holds.

People view weights as an intended obfuscation by the party releasing it. It is not! In fact, it is equally as hard for them to "understand" why a certain value at a certain index is what it is, as it is for you! Just ask Anthropic. They are also doing poke this weight, see what pops with their own models.

Again, that is why I used the analogy above. You are arguing that if someone uses a hardcoded value in their code, and won't share how they derived that value, it somehow isn't open source. That, IMO, is wrong.

> Again, that is why I used the analogy above. You are arguing that if someone uses a hardcoded value in their code, and won't share how they derived that value, it somehow isn't open source. That, IMO, is wrong.

It feels like you deliberately ignore the source part of "open source". If you have X that produces Y, then X is the source, Y is the output. You cannot "open source" Y as Y isn't a source to anything, it's the output from the source. The only part you can "open source" is the source part, which is X in this case.

Interesting points, regardless of how much I disagree with them, so thank you for sharing your views :)

> Think about it this way: just because someone wrote hello world in c and then a compiler translated that into assembly

I understand your point, since it's technically assembly, you could license that assembly under a FOSS license and now the thing you distributed is "open". I agree you could do this, but you shouldn't use "open source" to describe what you're doing there, unless the actual source for building that asm is open too. The binary might be available, but "open source" is something that applies to source code, not to what we distribute. If your source is C and your output is assembly, but you only try to apply a FOSS license to the output, not the source, it'd be a hard sell to call the source is open and available.

The closest I've come to finding some sort of backing to this view I hold is what OSI echos here:

> What if I do not want to distribute my program in source code form? Or what if I don’t want to distribute it in either source or binary form?

> If you don’t distribute source code, then what you are distributing cannot meaningfully be called “Open Source”. And if you don’t distribute at all, then by definition you’re not distributing source code, so you’re not distributing anything Open Source. [...] Open Source licenses are always applied to the source code — so if you’re not distributing the source, then you’re not distributing the thing to which an Open Source license applies

https://opensource.org/faq#non-distribution

Similarly, I wouldn't call a song I release as "open source" (not that it makes much sense in this case) unless the actual "source" of how it was produced is public under a FOSS license, even if you can technically read the sound data however you want, and modify it by patching the audio file. Instead, some other liberal license is more suitable that allows using/modifying/redistributing the output however you want (Creative Commons is common for those use cases), but not a license that is specifically about source code.

> That is semantics. The linux kernel is published in c because it's easier for people to reason in that abstracted language, but it would not suddenly become "closed source" if it were written in asm, assuming it would still be published under an open source license.

I agree with this too, if suddenly the kernel was written in asm, and it's being distributed as asm, then you can license that asm as "open source" and that'd be OK. What wouldn't be "open source", would be if it's written in C, but that C code isn't licensed "open source", but the authors tries to argue that the compiled asm output is "open source". It's output, not source, so you cannot license the output as "open source" as it's missing that last part, the "source".

> In other words, you having access to the "dataset" would not make the weights any easier to work with. They would still be in a "blob" as you call it.

Precisely. So the requirements end up something like: Can I build this thing from scratch myself, granted I have the required equipment + knowledge + time?

For LLM models, at least the training script + the dataset has to be available without restrictions for that to be possible. If they're not available, or available but under restrictions (usage or otherwise), then it's not open source.

Haha, having lengthy discussions, especially when we disagree, is healthy IMO. That's how we get to experience other viewpoints, and hopefully become better for the effort.

> Can I build this thing from scratch myself

You absolutely can. Everything you need is in the model config (layers, stuff) and there are training scripts all over the net. Now, granted, you will not necessarily get the same results, but then again neither is Mistral or Meta.

> but the authors tries to argue that the compiled asm output is "open source". It's output, not source, so you cannot license the output as "open source" as it's missing that last part, the "source".

Replying here because I can't in the other subthread. I think you are using a misconception on what is source code, and what is a weight. In the LLM world, you already have the source code for inferencing. This would be either pytorch or c code or whatever. You also have the architecture code. You can see what the model looks like, what layers it has, what ops it does to reach a result. That is also open! So you get the source to run inference. You get the source to "load" the model (i.e. the architecture, layers, etc). And you get a bunch of hardcoded values.

What you don't get is the why behind the question "why is this value x and not y". And for the most part, no one knows.

> If they're not available, or available but under restrictions (usage or otherwise), then it's not open source.

Let's take another (famous) example. Quake is famous for having a hardcoded value somewhere in the source code, that speeds up some geometry computations. Now, you can change that value, but things will be messed up in the engine. Collisions will happen weirdly, things will look bad. Now, is quake any less of "open source" if you or I don't understand why the original coder chose that value? Of course not! Well, now just multiply that with 1B hardcoded values. It's the exact same thing. You could change any of the values, but the game would look wonky as you do so. But, at the end of the day, it would not be any less open source.

I guess what I'm trying to say is that weights are not binary blobs. Weights are not an obfuscation attempt. Weights are distributed exactly how they are intended to be used, and how they are being used by the creators as well. You can change the architecture of a model (see above for details). You can add layers, you can remove layers. You can perform "abliterations", or you can do fine-tuning. Everything is exactly done as the "creators" intended. The only thing you don't have is "how they got those exact same numbers". But you don't need that. And it's funny that somehow for LLMs that's a bridge too far. It never used to be for any other project.