| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by latchkey 151 days ago

https://huggingface.co/unsloth/GLM-4.7-GGUF

This user has also done a bunch of good quants:

https://huggingface.co/0xSero

2 comments

WanderPanda 151 days ago

I find it hard to trust post training quantizations. Why don't they run benchmarks to see the degradation in performance? It sketches me out because it should be the easiest thing to automatically run a suite of benchmarks

link

Miraste 151 days ago

Unsloth doesn't seem to do this for every new model, but they did publish a report on their quant methods and the performance loss it causes.

https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs

It isn't much until you get down to very small quants.

link

dajonker 151 days ago

Yes I usually run Unsloth models, however you are linking to the big model now (355B-A32B), which I can't run on my consumer hardware.

The flash model in this thread is more than 10x smaller (30B).

link

a_e_k 151 days ago

When the Unsloth quant of the flash model does appear, it should show up as unsloth/... on this page:

https://huggingface.co/models?other=base_model:quantized:zai...

Probably as:

https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF

link

homarp 151 days ago

it'a a new architecture. Not yet implemented in llama.cpp

issue to follow: https://github.com/ggml-org/llama.cpp/issues/18931

link

dumbmrblah 151 days ago

One thing to consider is that this version is a new architecture, so it’ll take time for Llama CPP to get updated. Similar to how it was with Qwen Next.

link

cristoperb 151 days ago

Apparently it is the same as the DeepseekV3 architecture and already supported by llama.cpp once the new name is added. Here's the PR: https://github.com/ggml-org/llama.cpp/pull/18936

link

khimaros 150 days ago

has been merged

link

latchkey 151 days ago

There are a bunch of 4bit quants in the GGUF link and the 0xSero has some smaller stuff too. Might still be too big and you'll need to ungpu poor yourself.

link

disiplus 151 days ago

yeah there is no way to run 4.7 on a 32g vram this flash is something that im also waiting to try later tonight

link

omneity 151 days ago

Why not? Run it with vLLM latest and enable 4bit quantization with bnb, and it will quantize the original safetensors on the fly and fit your vram.

link

disiplus 151 days ago

because how huge glm 4.7 is https://huggingface.co/zai-org/GLM-4.7

link

omneity 151 days ago

Except this is GLM 4.7 Flash which has 32B total params, 3B active. It should fit with a decent context window of 40k or so in 20GB of ram at 4b weights quantization and you can save even more by quantizing the activations and KV cache to 8bit.

link