Did you have any Anthropic vs OpenAI specification issues with Claude Code? I have been using mlx_vlm and vMLX and I get 400 Bad Request errors from Claude Code. Presumably you're not seeing those issues with llama-server ?
Correct, no issues because since at least a few months, llama.cpp/server exposes an Anthropic messages API at v1/messages, in addition to the OpenAI-compatible API at v1/chat/completions. Claude Code uses the former.
At least for the Gemma4-26B-A4B, Token-gen speed with OMLX is far worse on my M1 Max 64GB Macbook, compared to llama-server:
Quick benchmark on M1 Max 64GB, Gemma 4 26B-A4B (MoE), comparing matched dynamic 4-bit quants. Workload
was Claude Code, which sends ~35K tokens of input context per request (system prompt + tools + user
message):
llama.cpp (unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL, llama-server -fa on -c 131072 --jinja --temp 1.0
--top-p 0.95 --top-k 64):
- pp ≈ 395 tok/s
- tg ≈ 40 tok/s
oMLX (unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit, omlx serve --model-dir ~/models/omlx, with
sampling.max_context_window and max_tokens bumped to 131072 in ~/.omlx/settings.json):
- pp ≈ 350 tok/s
- tg ≈ 5–13 tok/s
Same model family and quant tier. Prompt processing is comparable, but oMLX's token generation is 3–7x
slower than llama.cpp's Metal backend. Counter-intuitive given MLX is Apple's native ML framework.
Same. Opencode + oMLX (0.3.4) + unsloth-Qwen3-Coder-Next-mlx-8bit on my M5 Max w 128GB is the sweet spot for me locally. The prompt decode caching keeps things coherent and fast even when contexts get north of 100k tokens.