| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by parched99 421 days ago

I am only able to get the Gemma-3-27b-it-qat-Q4_0.gguf (15.6GB) to run with a 100 token context size on a 5070 ti (16GB) using llamacpp.

Prompt Tokens: 10

Time: 229.089 ms

Speed: 43.7 t/s

Generation Tokens: 41

Time: 959.412 ms

Speed: 42.7 t/s

3 comments

tbocek 421 days ago

This is probably due to this: https://github.com/ggml-org/llama.cpp/issues/12637. This GitHub issue is about interleaved sliding window attention (iSWA) not available in llama.cpp for Gemma 3. This could reduce the memory requirements a lot. They mentioned for a certain scenario, going from 62GB to 10GB.

link

parched99 421 days ago

Resolving that issue, would help reduce (not eliminate) the size of the context. The model will still only just barely fit in 16 GB, which is what the parent comment asked.

Best to have two or more low-end, 16GB GPUs for a total of 32GB VRAM to run most of the better local models.

link

nolist_policy 421 days ago

Ollama supports iSWA.

link

idonotknowwhy 421 days ago

I didn't realise the 5070 is slower than the 3090. Thanks.

If you want a bit more context, try -ctv q8 -ctk q8 (from memory so look it up) to quant the kv cache.

Also an imatrix gguf like iq4xs might be smaller with better quality

link

parched99 421 days ago

I answered the question directly. IQ4_X_S is smaller, but slower and less accurate than Q4_0. The parent comment specifically asked about the QAT version. That's literally what this thread is about. The context-length mention was relevant to show how it's only barely usable.

link

floridianfisher 421 days ago

Try one of the smaller versions. 27b is too big for your gpu

link

parched99 421 days ago

I'm aware. I was addressing the question being asked.

link