| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by d4rkp4ttern 70 days ago

You can use llama.cpp server directly to serve local LLMs and use them in Claude Code or other CLI agents. I’ve collected full setup instructions for Gemma4 and other recent open-weight LLMs here, tested on my M1 Max 64 GB MacBook:

https://pchalasani.github.io/claude-code-tools/integrations/...

The 26BA4B is the most interesting to run on such hardware, and I get nearly double the token-gen speed (40 tok/s) compared to Qwen3.5 35BA3B. However the tau2 bench results[1] for this Gemma4 variant lag far behind the Qwen variant (68% vs 81%), so I don’t expect the former to do well on heavy agentic tool-heavy tasks:

[1] https://news.ycombinator.com/item?id=47616761

1 comments

peder 69 days ago

Did you have any Anthropic vs OpenAI specification issues with Claude Code? I have been using mlx_vlm and vMLX and I get 400 Bad Request errors from Claude Code. Presumably you're not seeing those issues with llama-server ?

link

d4rkp4ttern 69 days ago

Correct, no issues because since at least a few months, llama.cpp/server exposes an Anthropic messages API at v1/messages, in addition to the OpenAI-compatible API at v1/chat/completions. Claude Code uses the former.

link

selectodude 69 days ago

I’ve jumped over to oMLX. A ton of rough edges but I think it’s the future.

link

d4rkp4ttern 69 days ago

At least for the Gemma4-26B-A4B, Token-gen speed with OMLX is far worse on my M1 Max 64GB Macbook, compared to llama-server:

  Quick benchmark on M1 Max 64GB, Gemma 4 26B-A4B (MoE), comparing matched dynamic 4-bit quants. Workload
  was Claude Code, which sends ~35K tokens of input context per request (system prompt + tools + user
  message):

  llama.cpp (unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL, llama-server -fa on -c 131072 --jinja --temp 1.0
  --top-p 0.95 --top-k 64):
  - pp ≈ 395 tok/s
  - tg ≈ 40 tok/s

  oMLX (unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit, omlx serve --model-dir ~/models/omlx, with
  sampling.max_context_window and max_tokens bumped to 131072 in ~/.omlx/settings.json):
  - pp ≈ 350 tok/s
  - tg ≈ 5–13 tok/s

  Same model family and quant tier. Prompt processing is comparable, but oMLX's token generation is 3–7x
  slower than llama.cpp's Metal backend. Counter-intuitive given MLX is Apple's native ML framework.

link

unstatusthequo 66 days ago

Check out vMLX if you use Apple Silicon. https://github.com/jjang-ai/mlxstudio

link

vlowther 69 days ago

Same. Opencode + oMLX (0.3.4) + unsloth-Qwen3-Coder-Next-mlx-8bit on my M5 Max w 128GB is the sweet spot for me locally. The prompt decode caching keeps things coherent and fast even when contexts get north of 100k tokens.

link

peder 69 days ago

Have you been using `omlx serve`? If so, how are you bumping up the max context size? I'm not seeing a param to go above 32k?

link

d4rkp4ttern 69 days ago

you can set it in the .omlx/settings.json - ask a code-agent to figure it out by pointing it at the omlx repo

link