Hacker News new | ask | show | jobs
by LuxBennu 75 days ago
that tracks with what i've noticed practically. shorter prompts feel basically the same between llama.cpp metal and what i'd expect from native mlx, but once context gets longer the overhead starts showing up. would be interesting to see if ollama's mlx path actually handles kv cache differently under the hood or if it just skips the buffer sync layer
1 comments

If it's just about skipping some buffer sync that's something that could also be adopted by llama.cpp's own Metal backend, at least on Apple Silicon platforms.