| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ggerganov 2 days ago

I haven't spent a dime on cloud inference, so cannot make a direct comparison like you. But I can 100% attest to the fact that Qwen3.6-27B is a very capable local model for coding tasks. Over the last month and a half I've been using it almost daily, either on my M2 Ultra or on my RTX 5090 box. I use it for small mundane tasks at ggml-org [0] - nothing really impressive, but definitely a helpful tool for a maintainer. I think I would be using it much more, if I didn't have to spend a lot of my time on reviewing PRs. Currently, I have a very lightweight harness - the pi agent with everything stripped (`pi -nc --offline`) and a short system prompt [1] to align it a bit with my style. About the generation speed: ~100-150 t/s on the RTX 5090 and ~40 t/s on the Mac. I definitely prefer running it on the RTX machine - it's so much faster. But for the sake of testing and getting wider experience with local configurations, I often run it on the Mac too.

[0] - https://github.com/search?q=%22Assisted-by%22+user%3Aggml-or...

[1] - https://github.com/ggml-org/llama.cpp/blob/master/.pi/gg/SYS...

6 comments

trilogic 1 day ago

I also confirm that local inference is on par with proprietary cloud services (with a bit of local setup, simple agents.md and some utils skills). This local models come with tools, that's mind blowing, considering that some months ago we had to .md tools ourselves. What makes a model worth even more is "Memory". We implemented that long ago. Last time I used proprietary services was 3 months ago, don´t really need it, my subscription is going blank.

Gerganov, hope you will consider developing further the CLI cause we suffering with the server.

link

jayGlow 1 day ago

what are you using for memory with your local models? is there a specific harness you would recommend for local agents?

link

mft_ 1 day ago

I’m using Hermes at the moment - it comes with lots of tools already baked in for the agent to use - for example web and browser access just worked, rather than having to mess around loads with config scripts and plugins.

I’ve also tried OpenCode (similar but a bit less so) and Pi (fast but you have to add lots of features yourself which is a bit of a pain). Claude Code can also be pointed at a local model and works, but the default system prompt is huge. (~140k of text when I extracted mine, IIRC.)

link

trilogic 1 day ago

I use HugstonOne (that backend a personalized version of llama.cpp). Implemented it´s own double layer memory that recall the full or partial previous session/file with an ON/OFF switch (which picks up where left off in CLI or Server or both same time) and another that reads back a % of current tab if memory switch is off doing checkpoints every certain tokens, summarizing and referring back to it when needed (recalled by certain logics). There is more to it when involving local RAG (making it tripple memory layer) but thats a long story.

About the harness depends on for what you need it, but basically for a universal unit of measure, Harness is multilayered and logic and domain specific dependent. I would definitely include Type of Hardware, Model parameters/knowledge, Model Intelligence, Model size/context, type of conversion, type and quantization (models comes with some default tools), but adding your (domain specific), skills, tools, memory, logs, security, Rag, Online search... (which as scary as they sound are mostly simple logics in a txt file, like if this do that).

The full pack is Harness 10, every missing thing lower the harness score.

To answer to your question I would definitely recommend smth like HugstonOne (or anyway llama.cpp CLI) with Qwen 3.6 35B finetuned/distill (deepseek 4 or claude 4.7) with none of the current coding agents out there that are screaming internet connection and proprietary API and data collection. DO this, if you can find a tool that you can download and choose a local model (of your choice in whatever folder locally) and load it ready for inference without any need of internet connection that is the tool you should aim for. Right now there is none out there.

link

kpw94 1 day ago

> About the generation speed: ~100-150 t/s on the RTX 5090 and ~40 t/s on the Mac

Curious if you can share the prefill speed too?

I run locally on a crappy desktop (some AMD iGPU with Vulkan llama.cpp, 32 GB DDR4 RAM) for experimentation. I get 15 tok/s on generation for the qwen & gemma4 MoE models. I get around 150 tok/s prefill speed.

Reason I'm asking about the prefill is looking at my stats at work, I use between 20M to peaks of 300M input tokens daily. Some of those token are cached but in general, I seem to have roughly 500x more input tokens than output. So interested in prefill tok/s stats.

Huge Thank you for llama.cpp btw!!

link

ggerganov 1 day ago

Here are the prefill speeds:

    Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32109 MiB
  | model                          |       size |     params | backend  |  fa |            test |                  t/s |
  | ------------------------------ | ---------: | ---------: | -------- | --: | --------------: | -------------------: |
  | qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | CUDA     |   1 |   pp2048 @ d512 |      3714.02 ± 10.85 |
  | qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | CUDA     |   1 |  pp2048 @ d1024 |      3684.86 ± 15.21 |
  | qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | CUDA     |   1 |  pp2048 @ d2048 |       3650.80 ± 8.53 |
  | qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | CUDA     |   1 |  pp2048 @ d8192 |       3473.88 ± 0.97 |
  | qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | CUDA     |   1 | pp2048 @ d32768 |       2754.69 ± 4.07 |

  ggml_metal_device_init: GPU name:   MTL0 (Apple M2 Ultra)
  | model                          |       size |     params | backend  | fa |            test |                  t/s |
  | ------------------------------ | ---------: | ---------: | -------- | -: | --------------: | -------------------: |
  | qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | MTL      |  1 |   pp2048 @ d512 |        379.75 ± 0.21 |
  | qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | MTL      |  1 |  pp2048 @ d1024 |        377.15 ± 0.35 |
  | qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | MTL      |  1 |  pp2048 @ d2048 |        371.46 ± 0.91 |
  | qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | MTL      |  1 |  pp2048 @ d8192 |        344.84 ± 0.41 |
  | qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | MTL      |  1 | pp2048 @ d32768 |        222.42 ± 5.29 |

Btw, based on your numbers, I think our use cases are quite different. I use the agent for very targeted sessions - basically things that are clear to me how to do, just want to automate them. My workflow is usually: new session -> read this, this and this -> do that. I.e. I don't let it wander at all in the codebase, so I rarely exceed the context window.

Also, I get a lot of mileage from the ngram-based speculative decoding functionality [0] as it allows me to iterate on the implementation much faster.

[0] https://github.com/ggml-org/llama.cpp/pull/19164

link

kpw94 1 day ago

Thanks! Super helpful.

I do use it the same way as you're describing on personal projects at home, in a very crude manner (pasting code snippets in llama server web UI prompt. Next will attempt OpenCode)

At work I use it in similar manner with more mature tools, but the vast majority of token spend comes from a totally different workflow: "pretend the AI is a fleet of junior/intern engineer you're delegating work to", where the agent will on its own do the implementation, commit the changes etc.

It does indeed spend a lot of tokens wandering the codebase, talking to MCPs, loading skills etc.

link

girvo 1 day ago

> Currently, I have a very lightweight harness - the pi agent with everything stripped (`pi -nc --offline`) and a short system prompt [1] to align it a bit with my style

This really is the secret to getting the most out of these models IMO. Pi is so damned good. I have a strongly tuned Pi for running Step 3.7 Flash (IQ4_XS) and Qwen 3.6 27B (FP8)

Also, thank you for llama.cpp mate :)

link

androiddrew 1 day ago

I have never heard of step 3.7 flash. Why do you like it? What rough spots have you encountered?

link

celrod 1 day ago

What quant do you run it at? 32GB seems like cutting it close on the rtx 5090 if going 8b, but other commenters are saying 4b lobotomizes the model.

link

ggerganov 1 day ago

As a baseline, I run all models in Q8 [0] because I want to be confident that when I observe a problem, the root cause is not due to the quantization. However, in this specific case, I use Q8 on the mac and Q4 on the RTX machine because the latter does not fit the full context at Q8. So far, I don't have conclusive evidence that the Q4 quantization affects the quality in a significant way for this model and the tasks that I am using it for.

[0] https://huggingface.co/ggerganov/presets/blob/main/preset.in...

link

girvo 1 day ago

27B seems surprisingly resiliant to quantisation. Though my evals showed there was some impact to coding ability from 8 bit to 4 bit, it was less than I would've expected: and it was on task types that you've said above that you don't really do with these!

link

toddmorey 1 day ago

For the curious, it looks like a PC with a RTX 5090 32GB graphics card will run you about $6,000.

link

fridder 1 day ago

Not too shabby. I like the regular Qwen but prompt prefill on my m1max is slow as hell

link