| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by imiric 311 days ago

I have a similar setup as the author with 2x 3090s.

The issue is not that it's slow. 20-30 tk/s is perfectly acceptable to me.

The issue is that the quality of the models that I'm able to self-host pales in comparison to that of SOTA hosted models. They hallucinate more, don't follow prompts as well, and simply generate overall worse quality content. These are issues that plague all "AI" models, but they are particularly evident on open weights ones. Maybe this is less noticeable on behemoth 100B+ parameter models, but to run those I would need to invest much more into this hobby than I'm willing to do.

I still run inference locally for simple one-off tasks. But for anything more sophisticated, hosted models are unfortunately required.

3 comments

elsombrero 311 days ago

On my 2x 3090s I am running glm4.5 air q1 and it runs at ~300pp and 20/30 tk/s works pretty well with roo code on vscode, rarely misses tool calls and produces decent quality code.

I also tried to use it with claude code with claude code router and it's pretty fast. Roo code uses bigger contexts, so it's quite slower than claude code in general, but I like the workflow better.

this is my snippet for llama-swap

``` models: "glm45-air": healthCheckTimeout: 300 cmd: | llama.cpp/build/bin/llama-server -hf unsloth/GLM-4.5-Air-GGUF:IQ1_M --split-mode layer --tensor-split 0.48,0.52 --flash-attn on -c 82000 --ubatch-size 512 --cache-type-k q4_1 --cache-type-v q4_1 -ngl 99 --threads -1 --port ${PORT} --host 0.0.0.0 --no-mmap -hfd mradermacher/GLM-4.5-DRAFT-0.6B-v3.0-i1-GGUF:Q6_K -ngld 99 --kv-unified ```

link

imiric 311 days ago

Thanks, but I find it hard to believe that a Q1 model would produce decent results.

I see that the Q2 version is around 42GB, which might be doable on 2x 3090s, even if some of it spills over to CPU/RAM. Have you tried Q2?

link

elsombrero 311 days ago

well, I tried it and it works for me. llm output is hard to properly evaluate without actually using it.

I read a lot of good comments on r/localllama, with most people suggesting qwen3 coder 30ba3b, but I never got it to work as well as GLM 4.5 air Q1.

As for using Q2, it will fit in vram, but with very small context or spill over to RAM, but with quite an impact on speed depending on your setup. I have slow ddr4 ram and going for Q1 has been a good compromise for me, but YMMV.

link

ericdotlee 311 days ago

What is llama-swap?

Been looking for more details about software configs on https://llamabuilds.ai

link

elsombrero 311 days ago

https://github.com/mostlygeek/llama-swap

it's a transparent proxy that automatically launches your selected model with your preferred inference server so that you don't need to manually start/stop the server when you want to switch model

so, let's say I have configured roo code to use qwen3 30ba3b as the orchestrator and glm4.5 air as coder, roo code would call the proxy server with model "qwen3" when using orchestrator mode and then kill llama.cpp with qwen3 and restart it with "glm4.5air"

link

ThatPlayer 311 days ago

> behemoth 100B+ parameter models, but to run those I would need to invest much more into this hobby than I'm willing to do.

Have you tried newer MoE models with llama.cpp's recent '--n-cpu-moe' option to offload MoE layers to the CPU? I can run gpt-oss-120b (5.1B active) on my 4080 and get a usable ~20 tk/s. Had to upgrade my system RAM, but that's easier. https://github.com/ggml-org/llama.cpp/discussions/15396 has a bit on getting that running

link

imiric 310 days ago

I use Ollama which offloads to the CPU automatically IIRC. IME the performance drops dramatically when that happens, and it hogs the CPU making the system unresponsive for other tasks, so I try to avoid it.

link

ThatPlayer 310 days ago

I don't believe that's the same thing. That should be the generic offloading that ollama will do to any too big model, while this feature requires MoE models. https://github.com/ollama/ollama/issues/11772 is the feature request for similar on ollama.

One comment in that thread mentions getting almost 30tk/s from gpt-oss-120b on a 3090 with llama.cpp compared to 8tk/s with ollama.

This feature is limited to MoE models, but those seem to be gaining traction with gpt-oss, glm-4.5, and qwen3

link

imiric 310 days ago

Ah, I was not aware of that, thanks. I'll give it a try.

link

mycall 311 days ago

> 20-30 tk/s

or ~2.2M tk/day. This is how we should be thinking about it imho.

link

chpatrick 310 days ago

Is it? If you're the only user then you care about latency more than throughput.

link

mycall 310 days ago

Not if you have a queue of work that isn't a high priority, like edge compute to review changes in security cam footage or prepare my next day's tasks (calendar, commitments, needs, etc)

link