| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by troad 79 days ago
	I very recently installed llama.cpp on my consumer-grade M4 MBP, and I've been having loads of fun poking and prodding the local models. There's now a ChatGPT style interface baked into llama.cpp, which is very handy for quick experimentation. (I'm not entirely sure what Ollama would get me that llama.cpp doesn't, happy to hear suggestions!) There are some surprisingly decent models that happily fit even into a mere 16 gigs of RAM. The recent Qwen 3.5 9B model is pretty good, though it did trip all over itself to avoid telling me what happened on Tiananmen Square in 1989. (But then I tried something called "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive", which veers so hard the other way that it will happily write up a detailed plan for your upcoming invasion of Belgium, so I guess it all balances out?)

4 comments

theshrike79 78 days ago

Qwen3.5 has tool calling, so you can give it a wikipedia tool which it uses to know what happened in Tiananmen Square without issues =)

link

troad 78 days ago

That's very cool! I think giving it some research tools might be a nifty thing to try next. This is a fairly new area for me, so pointers or suggestions are welcome, even basic ones. :)

Worth adding that I had reasoning on for the Tiananmen question, so I could see the prep for the answer, and it had a pretty strong current of "This is a sensitive question to PRC authorities and I must not answer, or even hint at an answer". I'm not sure if a research tool would be sufficient to overcome that censorship, though I guess I'll find out!

link

theshrike79 78 days ago

Basically ask any coding agent to create you a simple tool-calling harness for a local model and it'll most likely one-shot it.

Getting the local weather using a free API like met.no is a good first tool to use.

Thanks!

I'd recommend it too, because the knowledge cutoff of all the open weight Chinese models (M2.7, Qwen3.5, GLM-5 etc) is earlier than you'd think, so giving it web search (I use `ddgr` with a skill) helps a surprising amount

link

theshrike79 78 days ago

Yep, having a "stupid" central model with multiple tools is IMO the key to efficient agentic systems.

It needs to be just smart enough to use the tools and distill the responses into something usable. And one of the tools can be "ask claude/codex/gemini" so the local model itself doesn't actually need to do much.

link

zozbot234 78 days ago

> Yep, having a "stupid" central model with multiple tools is IMO the key to efficient agentic systems.

That doesn't fix the "you don't know what you don't know" problem which is huge with smaller models. A bigger model with more world knowledge really is a lot smarter in practice, though at a huge cost in efficiency.

link

spockz 78 days ago

Ive always wondered where the inflection point lies between on the one hand trying to train the model on all kinds of data such as Wikipedia/encyclopedia, versus in the system prompt pointing to your local versions of those data sources, perhaps even through a search like api/tool.

Is there already some research or experimentation done into this area?

link

zozbot234 78 days ago

The training gives you a very lossy version of the original data (the smaller the model, the lossier it is; very small models will ultimately output gibberish and word salad that only loosely makes some sort of sense) but it's the right format for generalization. So you actually want both, they're highly complementary.

link

theshrike79 78 days ago

That's the key, it just needs to be smart enough to 1) know it doesn't know and 2) "know a guy" as they say =) (call a tool for the exact information)

Picking a model that's juuust smart enough to know it doesn't know is the key.

link

whackernews 79 days ago

Oh does llama.cpp use MLX or whatever? I had this question, wonder if you know? A search suggests it doesn’t but I don’t really understand.

link

irusensei 78 days ago

>Oh does llama.cpp use MLX or whatever?

No. It runs on MacOS but uses Metal instead of MLX.

link

zozbot234 78 days ago

ANE-powered inference (at least for prefill, which is a key bottleneck on pre-M5 platforms) is also in the works, per https://github.com/ggml-org/llama.cpp/issues/10453#issuecomm...

link

OkGoDoIt 78 days ago

Is that better or worse?

link

irusensei 78 days ago

Depends.

MLX is faster because it has better integration with Apple hardware. On the other hand GGUF is a far more popular format so there will be more programs and model variety.

So its kinda like having a very specific diet that you swear is better for you but you can only order food from a few restaurants.

link

drob518 78 days ago

But you can always fall back to GGUF while waiting for the world to build a few more MLX restaurants. Or something like that; the analogy is a bit stretched.

link

irusensei 78 days ago

Yeah I'm terrible with analogies.

link

LoganDark 78 days ago

llama.cpp uses GGML which uses Metal directly.

link

austinthetaco 78 days ago

Have you played around with any of the Hermes models? they are supposed to be one of the best at non-refusal while keeping sane.

link

troad 78 days ago

Interesting! Unfortunately, the smallest Hermes 4 model I can see is 14B, which would really strain the limits of my little laptop. The only way I might get acceptable performance would be to run it extremely quantised, but then I probably wouldn't see much improvement over the 9B Qwen.

link

WesolyKubeczek 78 days ago

Cool, I always wanted to invade Belgium. Maybe if my plan is good, I could run a successful gofundme?

link

troad 78 days ago

Hey, if Margaret Thatcher's son can give it a go, why not you? Believe in yourself and reach for those dreams. *sparkle emoji*

link