| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nyrikki 126 days ago

It is crazy to me that it is that slow, 4 bit quants don't lose much with Qwen3 coder next and unsloth/Qwen3-Coder-Next-UD-Q4_K_XL gets 32 tps with a 3090 (24gb) as a VM with 256k context size with llama.cpp

Same with unsloth/gpt-oss-120b-GGUF:F16 gets 25 tps and gpt-oss20b gets 195 tps!!!

The advantage is that you can use the APU for booting, and pass through the GPU to a VM, and have nice safer VMs for agents at the same time while using DDR4 IMHO.

1 comments

lambda 125 days ago

Yeah, this is an AMD laptop integrated GPU, not a discrete NVIDIA GPU on a desktop. Also, I haven't really done much to try tweaking performance, this is just the first setup I've gotten that works.

link

nyrikki 125 days ago

The memory bandwidth of the Laptop CPU is better for fine tuning, but MoE really works well for inference.

I won’t use a public model for my secret sauce, no reason to help the foundation models on my secret sauce.

Even an old 1080ti works well for FIM for IDEs.

IMHO the above setup works well for boilerplate and even the sota models fail for the domain specific portions.

While I lucked out and foresaw the huge price increases, you can still find some good deals. Old gaming computers work pretty well, especially if you have Claude code locally churn on the boring parts while you work on the hard parts.

link

lambda 125 days ago

Yeah, I have a lot of problems with the idea of handing our ability to write code over to a few big Silicon Valley companies, and also have privacy concerns, environmental concerns, etc, so I've refused to touch any agentic coding until I could run open weights models locally.

I'm still not sold on the idea, but this allows me to experiment with it fully locally, without paying rent to some companies I find quite questionable, and I can know exactly how much power I'm drawing and the money is already spent, I'm not spendding hundreds a month on a subscription.

And yes, the Strix Halo isn't the only way to run models locally for a relatively affordable price; it's just the one I happened to pick, mostly because I already needed a new laptop, and that 128 GiB of unified RAM is pretty nice even when I'm not using most of it for a model.

link