Hacker News new | ask | show | jobs
by prvc 421 days ago
> ~15GB (MLX) leaving plenty of memory for running other apps.

Is that small enough to run well (without thrashing) on a system with only 16GiB RAM?

1 comments

I expect not. On my Mac at least I've found I need a bunch of GB free to have anything else running at all.
Any idea why MLX and ollama use such different amounts of ram?
I don't think ollama is quantizing the embeddings table, which is still full FP16.

If you're using MLX, that means you're on a mac, in which case ollama actually isn't your best option. Either directly use llama.cpp if you're a power user, or use LM Studio if you want something a bit better than ollama but more user friendly than llama.cpp. (LM Studio has a GUI and is also more user friendly than ollama, but has the downsides of not being as scriptable. You win some, you lose some.)

Don't use MLX, it's not as fast/small as the best GGUFs currently (and also tends to be more buggy, it currently has some known bugs with japanese). Download the LM Studio version of the Gemma 3 QAT GGUF quants, which are made by Bartowski. Google actually directly mentions Bartowski in blog post linked above (ctrl-f his name), and his models are currently the best ones to use.

https://huggingface.co/bartowski/google_gemma-3-27b-it-qat-G...

The "best Gemma 3 27b model to download" crown has taken a very roundabout path. After the initial Google release, it went from Unsloth Q4_K_M, to Google QAT Q4_0, to stduhpf Q4_0_S, to Bartowski Q4_0 now.