Hacker News new | ask | show | jobs
by version_five 999 days ago
They didn't mention llama.cpp or show it in their picture, that's hopefully an oversight, it feels like a major slight. It's a (the?) major reason for llamas popularity.

I have mixed feelings, llama is great but it's perpetuated it's shitty license. They could have done so much more good if they'd used gpl style licensing, instead they basically subverted open source, using an objectively good model as leverage.

2 comments

A lot of times there can be a feeling of being wrong without it being intentional. In this case I think the mention of AWS being a partner shows intent to put value behind what they are doing for their stakeholders.

The license for Llama 2 is pretty intense, but mirrors that intent by limiting interactions with individuals at scale, as well as limiting anything learned from the model through inference in being used to train another model. I suspect this is because the dataset on which it was trained is the company's IP, which again is for the shareholder's benefit.

The code is open though, I think out of necessity. AI poses a significant challenge for our survival, and making it open is an indication of transparency. They still need to make money at what they do and charge people for using their IP, within reason.

I guess my question would be that, if I used Llama (not the code, but the model itself) to code up a new model, would that be a derivative work?

Surely it's IP the shareholders have licensed, rather than their own IP.

Aka, my own comments being sublicensed back to me, after I licenced them to Facebook.

> It's a (the?) major reason for llamas popularity.

Absolutely not. There's a corner of the overall community that hovers it and overperceives it as everyone else only uses it too.

Its great if you have an Apple ARM machine and want to see an M2 Pro do 10 tokens/sec (and what could make an Apple ARM have 30 minute battery life).

I also doubt it's a slight, the only callouts are large commercial collaborations, ex. nVidia, AMD, Google are representative of each of the 3 groups we could assign it

I'd be curious if you have any hard data about use. Mine is anecdotal too, but I see that llama.cpp is the very close second highest starred repo with llama on the name, after meta llama. Additionally, all the HF models seem to have ggml / gguf quantized versions . I'm not aware of a competing format for quantized models. There are also python bindings which are used in a lot of projects. What is a competing framework, other than pytorch, that's getting more use? Or is it all just pytorch (and some hf wrappers) and the rest is a rounding error?
This reminds me of a comment elsewhere I also replied to today: it's sort of hard to even pretend I have global usage stats, so I won't.

There's a certain type of myopia that leads to overindexing on llama.cpp that makes it easy to classify. to wit:

> not aware of a competing format for quantized models

ONNX, that's how its done in prod and on other models besides (and including) LLaMa. Quantization is a general technique. 100 small variants of llama2 GGML weights feels like spam from that perspective. (sort of civitai vs. huggingface, hugginface smartly stopped that with AI art).

llm.mlc.ai for a more academic / less ad-hoc approach.

> [stars on github]

It's great for a very narrow & simple case that matches a large demographic on Github, and the demographics of people talking LLMs casually on HN: MacBook, wanna run locally and dream of a future free of having to ship your data to servers to get personalization. 5% of overall usage can be #2 in usage, if that makes sense.

> done in prod ... hugginface smartly stopped that with AI art ... more academic

Most human people doing LLM at home aren't interested in cargo culting the for-profit corporate and instituational stuff since their resources and incentives are so different from human being's incentives. As there are more humans than corporations or institutions and they tend to talk more, what they use tends to be more known than the stuff optimized for making a profit and serving business needs with business culture.

> This reminds me of a comment elsewhere I also replied to today

Right, looks like you made fun of / were condescendingly dismissive of my comment in another thread, I wouldn't have replied here if I'd realized you were the same person.

LOL I was thinking of an entirely different comment on another site. Give me credit here, I never cast aspersions on you, or even addressed you directly here.

I apologize for making you feel condescended to, but also would like to point out the _mean_ comment is +7, much less this one: there's a pretty significant gap in your knowledge and reality is going to keep intruding. Engaging in public is a wonderful way to learn, but you're coming across as glib and assertive and uninformed. You thought llama.cpp invented quantization and there's no other real format? :X

The “original” and by far most common format for quantization is GPTQ.

AWQ support is spreading more, which is nice.

Again, for a subset of the local LLM community. Quantization was not invented on Github, by llama.cpp, for LLMs in 2023.
If a tree falls in a forrest and no one is around, does it make a sound?

Of course quantization was invented well before LLMs. However, LLMs have dramatically accelerated development on quantization and resulted in an explosion in use.

fwiw I get more like 35-40 tokens/sec on my m1 macbook with a 7B model. That's way faster than I can read or skim. If we can figure out how to focus the expertise in small models, I don't see why it wouldn't be viable for those of us that don't want to share all of our convos with big tech.