| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by minimaxir 6 days ago

It's a bit awkward to release Gemma 4 12B (https://news.ycombinator.com/item?id=48385906), and then a canonical Q4_0 Gemma 4 12B a couple days later.

It's good that this post lists the expected VRAM usage for the models with Q4_0 Gemma 4 12B being 6.7GB, which will indeed fit Google's claims of fitting within 16GB comfortably, altough it confirms that only the quantized version will do so.

Relatedly, in Google's newly released Edge Gallery for macOS, Gemma 4 12B is explicitly listed as unsupported due to not enough RAM even on a 16GB machine, but given the expected VRAM usage here the Q4_0 variant definitely should fit and Google should fix that.

2 comments

Aurornis 6 days ago

I'm not sure why you think it's awkward to have multiple releases. It's better to release models and variations as they're ready, not withhold them all until everything is ready to release all at once.

The Q4_0 is a quantization aware training checkpoint. It's not a simple quantization of the original Gemma 4 12B.

link

netdur 6 days ago

not sure if I understand you, but 4Q and QAT 4Q are different

link

refulgentis 6 days ago

It's super annoying when you have products that utilize these because there's...4? releases in 3 weeks?

- Gemma 4 2B/4B/27BE3B/31B

- Gemma 4 2B/4B/27BE3B/31B x "assistant" / MTP drafter models (i.e. multitoken prediction)

- Gemma 4 12B (2 days ago? 1?)

- Gemma 4 QAT 2B/4B/12B/27BE3B/31B x "assistant" models (i.e. multitoken prediction)

It probably sounds silly and really whiny in the abstract. It just causes a ton of work / confusion downstream that feels unnecessary.

Extremely glad for the output, not glad to have to chase it.

ex. llama.cpp currently supports the originals but not the MTP predictors but there is a patch for the MTP predictors but not for the small MoE models and I think it supports the 12B but maybe not media for it yet and now we have these too and the blog says there's GGUFs (llama.cpp models) but there isn't in any of the 12? repos I clicked through. and ~every consumer-facing local LLM app is built on llama.cpp or a fork of it.

Also if anyone at Google is taking feedback over to b/ or product, pleaseeee stop the "E"2B "E"4B thing, unless it's actually taking up less RAM on Android during CPU inference. I can't tell if I need to treat the 4B like an 8B (i.e. beyond most consumer hardware without a GPU) or a 4B (i.e. will run on most consumer hardware since 2021)

EDIT: And, yes, the QAT 12B x mmproj does not work with llama.cpp. I'm glad there's people who have the luxury of not having to, well, actually use these and treat me as whining :) I'll need to schedule another 4-8 hours of work for the 4th time, no fun!

link

ddarolfi 6 days ago

These models aren't products? They are open source ish (open weight I guess), research outputs. While the naming scheme may be confusing, it is relevant and important. I believe it's on you to understand it.

link

sumedh 6 days ago

> I believe it's on you to understand it.

This is exactly why Google has 10 messenger Apps.

link

nolist_policy 5 days ago

Google released their latest messenger app 9 years ago. https://en.wikipedia.org/wiki/Google_Chat

link

refulgentis 6 days ago

I understand it. :)

And you're absolutely right to point out they aren't products - I hoped that was clear - when you're building a product with them, you end up having to do the same build loop 4 times, in this instance :)

link

overfeed 6 days ago

You can stop after the first one. Choosing to repeat the process is on you, and probably because you see some benefit in using the variant(s) you build on top of.

link

ddarolfi 6 days ago

Yes my framing was a little confusing. You were clear in that you are building products on them. I was more saying that because these gemma models are not products, and instead research outputs, the naming scheme should be more scientific rather than consumer friendly.

link

satvikpendem 6 days ago

Just use Unsloth Studio it supports them all.

link