| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by behnamoh 847 days ago

Gemma (and Gemini) are heavily nerfed. Why are they on the news lately?

Also, Gemma is a +9B model. I think it's not okay that Google compared it with Mistral and Llama 2 (7B) models.

Google also took llama.cpp and used it in one of their Github repos without giving credit. Again, not cool.

All this hype seems to be backed by Google to boost their models whereas in practice, the models are not that good.

Google also made a big claim about Gemini 1.5 1M context window, but at the end of their article they said they'll limit it to 128K. So all that 1M flex was for nothing?

Not to mention their absurd approach in alignment in image creation.

8 comments

gliptic 847 days ago

> Google also took llama.cpp and used it in one of their Github repos without giving credit. Again, not cool.

Are you talking about gemma.cpp? Then no, they didn't.

link

andy99 847 days ago

Presumably he means this https://cloud.google.com/blog/products/application-developme...

The claim is correct but not related to gemma

link

gliptic 847 days ago

From the repo:

> Run LLMs locally on Cloud Workstations. Uses:

> Quantized models from [Huggingface]

> llama-cpp-python's webserver

But sure, the blog post doesn't mention it.

link

behnamoh 847 days ago

Yes, I was talking about this.

link

sivakon 847 days ago

It’s objectively worse in my local tests compared to Mistral. Again their model doesn’t include MT-bench benchmark because it’s really really bad at answering a follow up question(s). (this is also a problem in Ultra). It’s reasoning is also pretty bad compared to mistral.

link

b33j0r 847 days ago

I can’t get it to recognize the stop token consistently in the 7b models.

About 50% of the shots, I get a sentence and a half of beautiful poetry, then a codeswitch into kanji, and then ral ral ral ral ral 膳 ral 杯 ral ral

Until I kill the process. Not every time, but way more often than the other llamas (which is basically never, these days).

I think they underestimated the impact of training on bulleted lists. It seems to love those!

link

rasbt 847 days ago

> Gemma is a +9B model

Yes that's correct. It's 9.3B parameters if you count the embedding layer and final projection layer separately. However, since they used weight tying, the adjusted count is 8.5B as discussed in the article.

link

neodymiumphish 847 days ago

Which still rounds to 9B and is 21.4% larger.

link

rasbt 847 days ago

Yes, it's definitely unfair to count it as a 7B model. In that case, we could call Llama 2, which is 6.6B parameters, a 6B (or even 5B) parameter model.

link

neodymiumphish 840 days ago

Except 6.6 rounds to 7. That’s completely reasonable. Arguing otherwise is pedantic.

link

htrp 847 days ago

> Google also took llama.cpp and used it in one of their Github repos without giving credit. Again, not cool.

They said it was inspired by llama

>This is inspired by vertically-integrated model implementations such as ggml, llama.c, and llama.rs.

from https://github.com/google/gemma.cpp

link

pests 846 days ago

Not gemma.cpp

He meant this:

https://cloud.google.com/blog/products/application-developme...

link

brucethemoose2 847 days ago

Counterpoints:

- Local models are pretty easy to de-censor, if thats what you mean.

- ...Yeah, it should not be labeled as a 7B. Its sort of 7B class.

- The repo mentions they use the llama-cpp-python server

- 1M context brute forced across TPUs is insanely expensive, I can see why Google reigned it in.

But overall your message is not wrong. Google is hyping Gemma a ton when its... Well, not very remarkable. And they could have certainly made something niche and interesting, like a long context 8.5B model, a specialized model, a vastly more multilingual model, something to differentiate it from Mistral 7B 0.2

link

d-z-m 847 days ago

> Also, Gemma is a +9B model. I think it's not okay that Google compared it with Mistral and Llama 2 (7B) models.

They say it's because they're not counting embedding parameters[0]. Although apparently even with the embedding parameters subtracted it still rounds to 8B not 7B. From what understand, rounding to the nearest B is the standard. Seems slightly disingenuous to call it 7B, but not a big deal IMO since I don't hear anyone saying this model is outperforming popular OSS 7Bs.

[0]: https://huggingface.co/google/gemma-7b/discussions/34

link

andy99 847 days ago

Gemma has a 7B parameters model https://huggingface.co/google/gemma-7b that's what I saw compared to Mistral

(Edit: I'm wrong)

link

light_hue_1 847 days ago

No it doesn't.

Gemma 7B is a 9B model. The name is a lie. Then they really played games with Gemma 2B as well.

I don't get how Google can be this incompetent and far behind everyone else. They have amazing people and the kinds of resources that almost no one else does but somehow need to resort to faking demos, blatant lies about model sizes, etc.

Google used to be the place everyone wanted to go. Someone at Google AI needs to be fired so they can start being productive again.

link

xfalcox 847 days ago

> Gemma 7B is a 9B model. The name is a lie

Ohhh so that explains why I couldn't load it on my RTX 4090, while other 7B models load just fine!

link

andy99 847 days ago

Haha really? I almost added the caveat that I didn't count the parameters myself. And I couldn't see the weights file size because it requires login (because of their restrictive licensing choice). If true and it's 9B that's really dishonest.

link

rasbt 847 days ago

Yes, it's 8.5B params if you account for weight tying, and 9.3B if you count the embedding layer and output layer weights separately as shown in the 2nd figure in the article. In the paper, I think they justified 7B by only counting the non-embedding parameters (7,751,248,896), which is kind of cheating in my opinion, because if you do that, then Llama 2 is basically a 5B-6B param model.

link

andy99 847 days ago

Is the 2B measured like that as well? I did use it with llama.cpp and noticed it ran slower than I expected.

That's the danger of too much abstraction, it's easy to have big gaps in one's understanding of what's really going on.

link

rasbt 847 days ago

Yes, it's somewhat similar to the 2B model as it uses the same vocabulary size.

link

brucethemoose2 847 days ago

Practically, speaking, I OOM'd running Gemma on a 3090 using a config that had VRAM to spare for Mistral 7B. It kinda surprised me at first, until I realized why.

link

cyanydeez 847 days ago

the context window is entirely limited by VRAM size

do you even LLM?

link