Hacker News new | ask | show | jobs
by gcr 22 days ago
here's a simple setup to get you started on an Apple M1 Max from 2021 with 32GB VRAM. it will download 20GB of models to `~/.cache/huggingface/hub`, which you can delete when you're done.

  /Users/gcr/llama.cpp/build/bin/llama-server
      -hf unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M
      --no-mmproj-offload
      --fit on
      -c 65536 # edit to taste
      --reasoning on --chat-template-kwargs '{"preserve_thinking": true}'
      --sleep-idle-seconds 90 # very aggressive: purge model from vram after this long
      -ctk q8_0 -ctv q8_0 # Optional. Lower memory use, but lower speed. Omit if you can.
I don't recommend ollama or lm-studio. Ollama's in the process of switching from their llama-cpp backend anyway, but their new go framework frequently OOMs and crashes on my hardware. I also don't recommend MLX-based inference backends on this hardware; I've found them to consistently reduce performance, contrary to what I've read online. I've tried all the llama-cpp metal forks, but right now, MTP, TurboQuant, MLX, etc etc etc are too new and just slow things down. It's all dust in the wind still.

For agent harnesses, opencode is okay, as is pi or even Zed's built in agent panel. Claude code "works" with ANTHROPIC_BASE_URL=http://localhost:8080/v1, but is very chatty (the default system prompt burns 20k tokens). Crush (from the charm-bracelet folks) is particularly nice when starting out. I've personally converged on pi-agent under an otherwise-mostly-default setup. You can ask qwen to customize pi or write you an extension which helps a little.

You'll need to add `http://localhost:8080/v1` as an OpenAI-compatible model provider in your coding harness with any API key (doesn't matter) and any model identifier (doesn't matter with llama-cpp).

Note that pi doesn't have permissions. Everything is permitted. The hundred hungry ghosts you've trapped in a jar WILL find a way to delete your home folder someday. That's what Man gets for summoning demons without casting a circle of protection first. Flying too close to the sun etc etc etc

Take backups and then go have fun. Hope this helps.

3 comments

I have a 5070TI (16gb VRAM) with 32GB system ram and a 16 core AMD cpu. I am considering buying a second used videocard, probably the same model, but not for months yet. This hardware setup is new-for-me in that a buddy gave me most of it and I bought the TI card.

Are there any resources to help me figure out how to best optimize my runtime paramaters for a given model, based on a given task, similar to what you've shown?

I've been a little... irritated? that hooking vscode up to my company LLM subscription seems so much more out-of-the-box capiable than what I can get to work. My assumption at the moment is that I need to create a lot of... I think they're called harnesses? agents? workflows? integrations? (not sure) by hand. Is that accurate?

Right now I have ollama running an nvidia nano model and I can poke it with a stick over a web interface I installed. It works, initial token response is slow, after that it seems fine enough.

I can't seem to get a good handle on how much context I've used, when context usage starts to degrade response accuracy, or in general how to mirror the results I get (not in terms of accuracy or speed, just features) from the company github copilot + vscode integration.

I was also trying to get a plugin called qodeassist working via qtcreator, mixed results there as well.

I've been keeping up with this space since the jump, never paid for a sub, work gave me a sub a handful of weeks ago, so the actual useage is all new to me.

I can't say I'm super impressed with any of it relative to the hype, but I found it neat to be able to point vscode at a c++ codebase and say "enable wextra, build the code, tell me if there is any low-hanging fruit I can clean up" and get a useful response.

I also asked my local model to turn a picture of my dog into a picture of an otter, got a blank picture back, which the thinking bit told me it would do. The whole thing was actually kind of funny. "I am allowed to edit pictures, I can't edit pictures, I am allowed to edit pictures, I'll tell the user I did and send a blank picture back because I can't edit pictures, but I am allowed to."

Can you elaborate more on the differences in running ollama or lmstudio? Do they actually slow down the speed of the inference and if so why? Or is it just a preference thing?
Ollama and LM-Studio are fine. Their main advantage is that they have a nice way to browse models -- LMStudio from huggingface and Ollama from their own curated list. Both are great ways of getting started. Pick LM-Studio if you'd like a nice GUI frontend to mlx-lm or llama-cpp; pick ollama if you'd like a nice command line interface and don't need non-default parameters.

LM-Studio doesn't support certain parameter combinations. For instance, LM-Studio supports KV quantization....but if you're using the MLX backend, you can't set the context length when KV quantization is used? Why? Running a model with certain settings requires keeping a little SAT solver going in your head. I found that overwhelming, so I just stopped using it.

The Ollama devs want to offer a central curated experience, but I perceive their approach as "playing fast and loose." They've re-implemented unique code for every model they support in their own Go runtime, so certain parameter choices aren't supported. On my hardware, their MLX backend just doesn't work at all without segfaulting the server process for example. It doesn't smack as vibe coded the way oMLX does, but it also doesn't smack as professional or battle-tested.

Ultimately, just dropping down to llama-cpp's GGUF model support and asking for default settings has provided faster inference speeds than anything I've been able to benchmark with them, but everything's within 10% of each other anyway so it's not a huge deal for me.

Thank you, that makes a lot of sense
Thanks a million!