Hacker News new | ask | show | jobs
by cptskippy 3 hours ago
I've been running qwen3-5-9b-q4-k-m and qwen3-6-27b-q6-k simultaneously on an Intel Arc Pro B70 with a lot of success.

https://github.com/cptskippy/battlemage-llm-gateway

Opencode has been a huge productivity accelerator. I have two Hermes agents that I'm training to support my workflow with pretty good success. One is a personal assistant who manages my backlog and keeps me on task, follows up with me on items, and will put together research briefs. The other I use a general purpose coder and research and it's about 50:50 with the tasks I've given it. In fairness though, the task it failed at left me scratching my head to figure out as well.

3 comments

Interesting setup, thx for sharing.

How many tokens/sec do you get with 27b? Are you using MTP?

Does Intel make decent GPUs now? I must be out of the loop...
They released a few good value GPUs for LLM inference about a year ago: more memory than AMD and NVIDIA consumer GPUs, not too expensive, but also not great tokens/watt.

I am not sure whether you can find those in stock anywhere.

What's the value running the smaller model too? Why not just the big model for everything? I note both are dense, as well.
Tokens per second. The difference between 8B and something like 16B is not as big as you might think in practical usage and 8B is a lot faster and interactive than 16B but there are certain things where it is useful to farm it out to the large model.
Agree. For local coding help, latency often matters more than raw benchmark quality. A slightly weaker model that answers immediately changes how often you reach for it.