|
|
|
|
|
by elsombrero
267 days ago
|
|
On my 2x 3090s I am running glm4.5 air q1 and it runs at ~300pp and 20/30 tk/s
works pretty well with roo code on vscode, rarely misses tool calls and produces decent quality code. I also tried to use it with claude code with claude code router and it's pretty fast.
Roo code uses bigger contexts, so it's quite slower than claude code in general, but I like the workflow better. this is my snippet for llama-swap ```
models:
"glm45-air":
healthCheckTimeout: 300
cmd: |
llama.cpp/build/bin/llama-server
-hf unsloth/GLM-4.5-Air-GGUF:IQ1_M
--split-mode layer --tensor-split 0.48,0.52
--flash-attn on
-c 82000 --ubatch-size 512
--cache-type-k q4_1 --cache-type-v q4_1
-ngl 99 --threads -1
--port ${PORT} --host 0.0.0.0
--no-mmap
-hfd mradermacher/GLM-4.5-DRAFT-0.6B-v3.0-i1-GGUF:Q6_K -ngld 99
--kv-unified
``` |
|
I see that the Q2 version is around 42GB, which might be doable on 2x 3090s, even if some of it spills over to CPU/RAM. Have you tried Q2?