| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by LuxBennu 78 days ago
	Already running qwen 70b 4-bit on m2 max 96gb through llama.cpp and it's pretty solid for day to day stuff. The mlx switch is interesting because ollama was basically shelling out to llama.cpp on mac before, so native mlx should mean better memory handling on apple silicon. Curious to see how it compares on the bigger models vs the gguf path

3 comments

yg1112 77 days ago

The key difference is that MLX's array model assumes unified memory from the ground up. llama.cpp's Metal backend works fine but carries abstractions from the discrete GPU world — explicit buffer synchronization, command buffer boundaries — that are unnecessary when CPU and GPU share the same address space. You'll notice the gap most at large context lengths where KV cache pressure is highest.

link

LuxBennu 77 days ago

that tracks with what i've noticed practically. shorter prompts feel basically the same between llama.cpp metal and what i'd expect from native mlx, but once context gets longer the overhead starts showing up. would be interesting to see if ollama's mlx path actually handles kv cache differently under the hood or if it just skips the buffer sync layer

link

zozbot234 77 days ago

If it's just about skipping some buffer sync that's something that could also be adopted by llama.cpp's own Metal backend, at least on Apple Silicon platforms.

link

lioeters 77 days ago

Insightful comment, thanks!

link

goldenarm 78 days ago

How many tokens per second?

link

LuxBennu 77 days ago

Roughly 8-12 token/s on generation depending on context length. Prompt processing is faster obviously. Haven't benchmarked it super carefully though, just eyeballing the llama.cpp output.

link

zozbot234 78 days ago

They initially messed up this launch and overwrote some of the GGUF models in their library, making them non-downloadable on platforms other than Apple Silicon. Hopefully that gets fixed.

link