| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cptskippy 3 hours ago

I've been running qwen3-5-9b-q4-k-m and qwen3-6-27b-q6-k simultaneously on an Intel Arc Pro B70 with a lot of success.

https://github.com/cptskippy/battlemage-llm-gateway

Opencode has been a huge productivity accelerator. I have two Hermes agents that I'm training to support my workflow with pretty good success. One is a personal assistant who manages my backlog and keeps me on task, follows up with me on items, and will put together research briefs. The other I use a general purpose coder and research and it's about 50:50 with the tasks I've given it. In fairness though, the task it failed at left me scratching my head to figure out as well.

3 comments

hbbio 3 hours ago

Interesting setup, thx for sharing.

How many tokens/sec do you get with 27b? Are you using MTP?

link

askvictor 3 hours ago

Does Intel make decent GPUs now? I must be out of the loop...

link

speedgoose 2 hours ago

They released a few good value GPUs for LLM inference about a year ago: more memory than AMD and NVIDIA consumer GPUs, not too expensive, but also not great tokens/watt.

I am not sure whether you can find those in stock anywhere.

link

jauntywundrkind 3 hours ago

What's the value running the smaller model too? Why not just the big model for everything? I note both are dense, as well.

link

Ritewut 3 hours ago

Tokens per second. The difference between 8B and something like 16B is not as big as you might think in practical usage and 8B is a lot faster and interactive than 16B but there are certain things where it is useful to farm it out to the large model.

link

Natalia724 2 hours ago

Agree. For local coding help, latency often matters more than raw benchmark quality. A slightly weaker model that answers immediately changes how often you reach for it.

link