Show HN: Rapid-MLX – Run local LLMs on Mac, 2-3x faster than alternatives

Y	Hacker News new \| ask \| show \| jobs

	Show HN: Rapid-MLX – Run local LLMs on Mac, 2-3x faster than alternatives (github.com)
	9 points by raullen 62 days ago

2 comments

c0rruptbytes 49 days ago

i use this, big fan - i have unlimited codex tokens if you ever need some dev assistance

Would definitely love benchmarks against omlx and fast-mlx one day (i also have a 256gb m3 ultra)

link

taylorhou 45 days ago

is this available for other open source projects? i'm stealing tokens from my employer effectively and keep hitting my limits re: codex tokens o_O

i'm working on github.com/teale-ai (distributed inference)

link

Johnny_Bonk 44 days ago

This looks somewhat interesting. Is the premise that if you have a strong Mac, you can rent out your hardware for someone to run inference on it?

link

raullen 62 days ago

Built this to run coding agents locally on Apple Silicon. The main problem I kept hitting: most models fail at structured tool calling, and existing servers are slow on MLX.

Two findings from benchmarking 7 models across 5 agent frameworks:

1. Qwen family gets 100% tool calling across every framework tested. Non-Qwen models (Llama, DeepSeek-R1) vary wildly — 40% to 100% depending on framework.

2. smolagents (HuggingFace) sidesteps structured function calling entirely by using code generation. DeepSeek-R1 goes from 40% with structured FC to 100% with smolagents.

Speed-wise, MLX's unified memory means zero CPU↔GPU copies. On an M3 Ultra: Qwen3.5-9B hits 108 tok/s (vs ~41 on Ollama), Qwen 3.6 35B does 100 tok/s with only 3B active params.

The full benchmark data is in the README. Happy to discuss the MLX performance characteristics or tool calling architecture.

link