Built this to run coding agents locally on Apple Silicon. The main problem I kept hitting: most models fail at structured tool calling, and existing servers are slow on MLX.
Two findings from benchmarking 7 models across 5 agent frameworks:
1. Qwen family gets 100% tool calling across every framework tested. Non-Qwen models (Llama, DeepSeek-R1) vary wildly — 40% to 100% depending on framework.
2. smolagents (HuggingFace) sidesteps structured function calling entirely by using code generation. DeepSeek-R1 goes from 40% with structured FC to 100% with smolagents.
Speed-wise, MLX's unified memory means zero CPU↔GPU copies. On an M3 Ultra: Qwen3.5-9B hits 108 tok/s (vs ~41 on Ollama), Qwen 3.6 35B does 100 tok/s with only 3B active params.
The full benchmark data is in the README. Happy to discuss the MLX performance characteristics or tool calling architecture.
Would definitely love benchmarks against omlx and fast-mlx one day (i also have a 256gb m3 ultra)