Hacker News new | ask | show | jobs
by rsolva 28 days ago
In our company of 24 employees, we get by with two DGX Sparks. We don't use AI heavily, but each Spark can serve about 6-8 concurrent requests with a full context lenght of 256k, which is decent. We get about ~35 t/s depending on the model we use (currently Qwen3.5 122B A10B and Qwen3 Coder Next), but we might set up a smaller model too for simpler tasks.

This works for us and will work for years to come. It is not SOTA, but it works darn well for our purposes, and we control the compute and data flowing through it, so totally worth it.

2 comments

That's pretty nice actually, how much KV cache does that model require at full context? That tends to be the main limit to running concurrent requests locally, there's KV quantization but it has outsized negative impact on model quality.
I have experimented with both q8 and q4 for KV cache. I can't find any difference between q8 and fp16, but q4 suffers more when the context grows. q8 seems like a good compromise and gives us enough ctx for about 6-8 concurrent, full context sessions. But we have not fully tested those limits yet, as the context windows rarely reach the limit.
This is pretty cool. How would you say that these open models compare to SOTA on coding tasks? I pay $200/mo for Claude Max but honestly this sounds way more fun.
Nowadays I use our local setup 95% of the time, but it is not that long since that flipped for me personally.

Context: I have a $20 Claude Code subscription, and have used it for a handfull of small-ish projects the last year, in parallel with local models on my AMD 9700XTX (24GB) at home. Mostly Ministral 14B and more recently Qwen3.6 27B Dense 4q.

Historically, the tooling (interferens engines and harness) has been the biggest challenge when using local models, a lot of the benefits from Claude Code was a rather unified and well oiled agent system. Local setups often bring with them sutle incompatibilities between models, inference engines and agent systems that are not obvious from initial testing, but cause trouble on projects larger than a couple of files.

The Spark setup at work is now at a point where I do not miss Claude, like at all. A big part of this is the harness and the tools available to the agent, most critically a good tool for searching online. I use my Kagi subscription to allow the models to fetch up-to-date information, and the Kagi MCP I use also has a summarizer which is very helpful in avoiding rapidly filling up the context window.

I mostly use Zed and it's native agent, which only recently got muuuch better, and on the terminal I use Pi with a minimal selection of extensions (currently pi-kagi-search, pi-smart-fetch, pi-btw and pi-diffloop). I also have Pi in Zed via the ACP, but it does not work so well with some of the extensions, especially the lack of a built-in permission system is a problem, when YOLO-mode is the only mode :)

Honestly, as long as you have a model that is decent at tool calling, your good. Having a solid and stable frame around your model makes a huge difference. The only caveat in all of this is that I spend most of my time on smaller projects and debugging on linux base systems, not huge and complex code bases, so your mileage might vary.

The next phase at work is to set up a chatGPT-like webinterface, and so far LibreChat is at the top of my shortlist. We had OpenWebUI for a while, but it is so bad at using MCP tools that it is practically non-functional for us. LibreChat is a bit more work to set up, but the interface and it's MCP story is much more solid. The goal is to plug in our internal helpdesk, docs and task manager system to LibreChat via MCPs to give us a quick way to query and gather information that is currently very time consuming to do on your own.