| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cgearhart 175 days ago
	Any notes on the problems with MLX caching? I’ve experimented with local models on my MacBook and there’s usually a good speedup from MLX, but I wasn’t aware there’s an issue with prompt caching. Is it from MLX itself or LMstudio/mlx-lm/etc?

2 comments

anon373839 175 days ago

There’s this issue/outstanding PR: https://github.com/lmstudio-ai/mlx-engine/pull/188#issuecomm...

link

dust42 175 days ago

It is the buffer implementation. [u1 10kTok]->[a1]->[u2]->[a2]. If you branch between the assistant1 and user2 answers then MLX does reprocess the u1 prompt of let's say 10k tokens while llama.cpp does not.

I just tested with GGUF and MLX of Qwen3-Coder-Next with llama.cpp and now with LMStudio. As I do branching very often, it is highly annoying for me to the point of being unusable. Q3-30B is much more usable then on Mac - but by far not as powerful.

link