| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by c0rruptbytes 5 hours ago

I don't know about good, I use a lot of local models and they're still pretty painful to run locally

You have dense models (qwen 27b, gemma 31b) who are pretty smart, but pretty slow

You have MoE models (gemma 26b, qwen 35b, north mini code 30b) who are pretty fast, but make a lot of mistakes

You need a lot of memory to run these well, quantization makes tool calling weaker, so most run at 4 bit quants and are wondering why it kinda sucks and that's because you've essentially lobotomized the model (I recommend unsloth quants, i recommend 6bit for MoEs and 5bit for dense)

So you need a lot of compute to make the pre-fill fast, you need bandwidth to make the decode fast, you need a lot of memory to hold everything - lot of ifs

On top of that, your laptop becomes a loud hot churning machine, it's uncomfortable to work with.

So are they good? not really. Do they work? yes

edit: just wanna clarify - i think open models are the future, i think they're super important, i'm contributing constantly to the ecosystem - i think people should play around with these models, i think people should use `pi` and learn how it all works - but don't download a model expecting it to be good out of the box, you will have to tune and configure a lot of stuff to replace a "coding agent" that most people are using models for

18 comments

saghm 4 hours ago

This is basically my experience as well. I have a moderately recent but high spec desktop (Radeon 6900 XT with 16 GB VRAM, Ryzen 9 7900X 12-core, 64 GB system RAM), and I tried out some recommended models with ollama a month or two ago. Anything not geared specifically towards coding seemed to struggled with actually making tool calls instead of just stating the actions they would take without making them (and trying to get help from them to explain what I needed to configure to change that behavior was useless; qwen refused to believe that it was running in ollama and insisted that it was running from the Alibaba cloud without access to my local system), and the models intended for coding were barely thinking faster than I could type (if they had any ability to show thinking at all).

The best "free" experience I've found is using OpenCode with Big Pickle. It's not especially smart, so it often won't produce the correct result the first time, but the free tier is generous enough that I don't think I've hit the limit more than twice over around a month with frequent multi-hour sessions. If running locally is truly the goal, it's not going to fit the bill, but if the goal is just "get the best experience without having to pay for a sub or tokens", it's the least bad option I've found so far.

spockz 3 hours ago

For what it is worth, I’m on a similar machine. (9070XT,5900X) and found a lot of performance improvement over ollama by compiling llama.cpp and running with —no-mmap and —perf. The context is still quite small though. With online models I use contexts of at least 200k which is useful for longer running/more complicated commands.

Locally I haven’t gone much further than 8k. That is sufficient for small changes on small code bases. And you need condensed tool output.

I haven’t tried any tool that compresses the tokens yet.

echelon 2 hours ago

I would rather we give up the idea of running open models on RTX cards and instead focus on running much bigger open models on H200s.

1. The hardware will eventually catch up.

2. This keeps the delta between frontier models smaller.

3. We can still fine tune and own the weights.

4. The models will be more useful, faster, and reliable.

RTX is hobbyist tier, not professional tier.

Gated cloud models from hyperscalers treat us like hobbyists in their own right.

We need equivalent scale models, but open.

zozbot234 1 hour ago

H200s and other enterprise datacenter GPUs are completely overkill in any realistic single- or few-users inference scenario. They're hugely unbalanced towards compute capacity which will go almost entirely unused (i.e. wasted) unless you're running huge batches on a continued basis. I've argued many times that local inference engines should support batched inference on a somewhat smaller scale for a variety of reasons (especially given the unexpected effectiveness of SSD streamed inference with larger-than-RAM models), but even I don't think we can realistically go to 300x or so for real-time inference, which is the range that pencils out quite consistently from a simple roofline model of these datacenter cards.

echelon 57 minutes ago

If you're doing professional work in coding or video, you can easily saturate a single H200.

This is what RunPod-type services are for.

For instance, ComfyUI is an abomination that can't do half of what Nano Banana and Seedance 2.0 can do. And you have to sit around and wait 10x longer for single results.

I can rent an H200 for $3.50 an hour. That's INSANELY cheap.

I do not understand this split between hosted APIs and rinky-dink local RTX models. Both suck.

The ideal solution is models we own run on RunPods leveraging H200s.

I can spend $100-200/day on compute making much more value with the model outputs.

----

edit: I want to respond to comments, but the damned HN rate limits keep me to five comments a day now because I'm a contrarian and say things that rile up the anti-AI folks.

You don't need to buy an H200. It's a depreciating asset. You rent one. It's cheap to rent.

spockz 40 minutes ago

Sure, to approach frontier model quality locally we need to have more power. And H200s are a way to get there.

However, we need to use the tools that we have. Even if I wanted to buy a (bunch of) H200 for me and my colleagues and could get the expense approved, they are hard to source where we are.

Yes. You can rent them, but I’m not sure how that affects the IP discussion.

Moreover, not everyone is doing coding and video so we have different tasks that can fit quite well on relatively light laptops (Gemma et al), for relatively directed coding sessions we can make do with RTX cards, or a small step up, all the way to H200 in the workstation. Or pods thereof.

We have the graphics cards and laptops with MLX right now. The H200 will take a year at least to arrive. Better get used to run stuff locally.

zozbot234 35 minutes ago

I'll definitely believe that for video generation models, but those are also very compute-intensive for rather middling results.

SR2Z 1 hour ago

That GPU costs 25k which means you really should have a rack to put it in. It's not realistic.

MrLeap 1 hour ago

There's a lot more professionals that have RTX cards than H200s. You're inevitably see more development and experimentation on things actual humans have lmao.

rapind 3 hours ago

> The best "free" experience I've found is using OpenCode with Big Pickle.

I have absolutely zero interest in free. I honestly don't think I'm even remotely in the same demographic as people using free tiers / models.

I want to pay. I don't want my data used for training. I want it to be open. I want it to be consistently up (more than Claude!). I want it to be fast. I don't want it to be subsidized as that's just an excuse for shitty quality. Deepseek flash knocks it out of the park on all of these except you're data is used in training. I'm fine with it being hosted since there's no way I'm using it 24/7, but data MUST be private.

Basically I want Hetzner and OVH to run open model clouds. I'm convinced this is going to happen eventually when everyone realizes this is a commodity.

rlkf 39 minutes ago

> Basically I want Hetzner and OVH to run open model clouds

You can run Qwen3 on OVH already:

<https://www.ovhcloud.com/en/public-cloud/ai-endpoints/catalo...>

saghm 2 hours ago

I'm probably somewhat adjacent to you. I would be happy to pay, but I just don't want to pay any of the companies that are actually offering things right now. I had the $20/month sub for Claude for a couple months, until one day I kept inexplicably getting errors saying I hit the limit even though their site showed my usage at less than half for the session and 8% for the week, and it seemed silly to pay for something that couldn't even properly respect its own measurements. OpenAI sketches me out too much as a company, Cursor feels lackluster when I use it for work from the account they pay for (and now is getting acquired by maybe the only AI company even sketchier than OpenAI), and I wasn't particularly impressed with Gemini or Mistral Vibe either when I tried them on the free tiers either.

rapind 2 hours ago

I was paying around $500 / month on average between multiple providers for over a year. I cancelled one a while ago because of pretty bad service availability (Bet you guess who that is!), which by all reports hasn't improved much.

For me, paying from $200 - $500 / month is reasonable if I can sustain a disruption free flow that doesn't require constant yak shaving. What I've found experimenting with DeepSeek on some open source library stuff is that it's actually going to cost me much less if I don't need frontier vibing (which I don't).

gaolei8888 1 hour ago

who?

darkmarmot 3 hours ago

Hard to guarantee it's private if you don't keep it local... I don't have a lot of trust for companies in this space.

rapind 3 hours ago

Yes, but I think that'll change eventually. If you trust hosting your code with a specific cloud provider then you'll probably also trust them for code assist. At least that's my theory.

There'll probably need to be a threat of massive litigation should they fail to comply with such a policy.

pessimizer 2 hours ago

> If you trust hosting your code with a specific cloud provider then you'll probably also trust them for code assist.

I'm interested in this thought. There is significant motivation for providers to create a verifiable way for them not to deal with having access to client interactions with LLMs at all. Whatever standards and protocols have to be come up with in order to reassure clients.

Any good standards for privacy when interacting with LLMs could also trickle down to smaller providers, and everyone could offer guarantees. Even if the guarantee was literally just an insurance policy and a private court to decide if it pays out.

aamoscodes 3 hours ago

You can pay, and also use deepseek-v4-flash. OpenRouter even lets you "block" or limit your usage to providers that don't train on data. Since the weights are open, other companies are already serving the model on non-DeepSeek owned hardware: https://openrouter.ai/deepseek/deepseek-v4-flash

rapind 3 hours ago

Good to know. I hadn't checks since early is DS4's launch when they were the only provide (I think maybe there was one other, but they also trained on your data). I see several private options now.

Bnjoroge 3 hours ago

You can specify which providers you want to serve your model in OpenRouter. Then you can chose US-based ones.

bel8 3 hours ago

These competent open models you want to use were trained on data from people like you and me.

I wonder if there are competent models trained purely on permissive open-source code like MIT or Apache 2.0.

yencabulator 3 hours ago

MIT and Apache 2.0 both require attribution, so it's not like limiting to those would help in license compliance.

redmalang 3 hours ago

Try llama.cpp it seems to be a lot more performant and a lot more hackable. Also I'm surprised how substantial the impact of some of the inference configs (beyond just temp) can have, though this is much more model specific.

ryukoposting 3 hours ago

I found that, with the heavily quantized Qwen3 models I can cram onto my 3060 Ti, telling the model to use its tools in the system prompt made it a lot more likely to actually do it. YMMV of course, but give it a shot.

saghm 5 minutes ago

I did try this, and it was pretty hit-or-miss still. I even went as far as configuring context for Zed to inject into all conversations saying stuff like "If you need to read a file, call read_file NOW. Do not say you will read it", and it still didn't really make a huge difference.

aftbit 4 hours ago

IMO running local models "well" still requires an expensive hardware investment. You really want 96GB of VRAM on a modern Blackwell arch to run these models with decent KV cache. Trying to run them on a unified memory Mac, an AI Max AMD processor, or a DGX Spark-alike is really just asking for trouble. Prefill kills perf.

If you throw the right GPUs at the problem, they become much better - but still not quite in the realm of Sonnet or DeepSeek 4 Flash, let alone Opus / DeepSeek Pro or Mythos/Fable/GPT-5.5.

Given enough budget, power, and cooling, you can run some pretty good data pipelines, but for code, I think it still makes sense to shell out to an API provider most of the time.

ryan_glass 2 hours ago

For a fraction of the price of 96GB vram, I built a desktop based on a supermicro server mobo and EPYC 9 series CPU, with just under 400GB rdimm ram (approx $4500 all in but this was before the ram price hike). Works really well for serving larger local modals at a decent enough speed (I consider anything more than 10 tokens/second usable and value accuracy over speed).

dofm 3 hours ago

FWIW I think it might be both.

Ultimately if you skip over the opportunity to play with these models on your own machine you are losing out on a lot of really interesting educational opportunities — it helps make a lot of stuff feel more concrete in a way that only tinkering can.

But then I think once I had an idea of something that I was building against Gemma 4 or Qwen 3.6 I would be looking at openrouter etc., to stabilise it for the next tier of experimentation (and to get back a kind of multi-device access without tailscale/lm link etc.).

Are they good enough to replace what people seem to want to do with Claude? Maybe not. But it's an unparalleled learning opportunity.

EagnaIonat 3 hours ago

Depends what you need the model to do. The recent granite4.1:3b just takes 2GB of memory and is fast. Results are pretty good and support tool calling. Barely a squeak out of the Mac laptop.

Even faster with the MLX builds.

Then when I need more heavy lifting I fire up a larger model.

IMHO the issue isn't the models. I've had OpenClaw give the same results as Claude using open models locally. Slower but does the job. Something that can do optimal model switching is what's needed.

jtbaker 3 hours ago

> Trying to run them on a unified memory Mac

> but still not quite in the realm of Sonnet or DeepSeek 4 Flash

these are not mutually exclusive anymore. DS4 has set the bar for me these days. https://github.com/antirez/ds4

trueno 1 hour ago

someone just put this on my radar yesterday, im about to try this today. how's your experience with it?

me thinks there's a lot of optimization strats we're currently leaving on the table just because the amount of things to explore and test are so expansive. but this one is super interesting targeting metal primarily and zeroing in on one model. instead of a one size fits all llama.cpp im very interested to see if theres a future where super tailor-made variants per model pans out to harnesses that can rapidly switch ultimately providing something akin to sonnet/early opus territory (that's my personal bench mark of good-enough i shall now cancel the hell out of this claude sub)

jtbaker 1 hour ago

I'm on the verge of cancelling my anthropic $20 plan since it's come out. On an M5 Max 128GB, hooked up to the pi.dev harness, I get in the neighborhood of 400-450tps prefill and 30-35tps generation. It is imminently usable and at times feels more stable than my previous CC setup. Occasionally there are things it struggles with that I will bounce back over to CC for, but it is highly usable. The future is bright for local models! As a tinkerer, it makes me really happy to have a local setup I can be just as productive in, and not have the token overlords ready to shut me down at any time.

aftbit 1 hour ago

That's DS4 Flash right? How does it feel in intelligence and speed compared to DS4 Flash hosted by Deepseek themselves or another API provider? I've been using API DS4 Flash for a lot of personal projects and have been quite impressed. I've spent $1 on building ~10 toy projects and gotten them all to work within the bounds of what I wanted without having to do much besides guide the model away from dumb loops.

jtbaker 51 minutes ago

I'm using the DS4 flash IQ2 2-bit quant, per Salvadore's recommendations for my hardware in the repo. I haven't messed with the cloud hosted variant. The only other paid API I have messed with is a $20 Anthropic sub, primarily with whatever the latest version of Sonnet is. For the most part, this local configuration feels on par with that.

With this configuration (set up over the last month) I have been working on Python data processing tools, an internal Svelte 5/SvelteKit data intensive BI app, and some smaller Rust projects. It's been doing really well there.

wincy 3 hours ago

If I could just save up $6000 I could sell off my RTX 5090 for $4,000 and buy an RTX 6000 Blackwell Pro Workstation. I can fit models into the 32GB of vram but my context window ends up being tiny for any halfway capable model.

layer8 2 hours ago

Isn’t the RTX 6000 Blackwell Pro Workstation over $13000 now?

monksy 1 hour ago

That RTX6000Pro you mentioned is $12k.

aftbit 1 hour ago

Yep - I'd say either that or 4x 5090 is a great entry point to running local models "well". Two of them would be even better. If you don't have $12-24k to spend, you can try your hand with tiny models or quants or slow speeds, but it will be a much more painful experience. You're already giving up a lot by dropping down from frontier models - you're giving up even more by trying to squeeze them into little RAM and compute.

Prices will fall in the next few years. Maybe just play with the tiny toy models for now to learn how they work, then keep using API providers until they do.

eek2121 4 hours ago

Not really, Qwen 27b offloads to a decent gaming GPU (RTX 4090 in my case) without needing tons of RAM.

mathisfun123 4 hours ago

can you give more info? llama.cpp vs vllm? config? i wanna try specifically this model

zozbot234 5 hours ago

Maybe we shouldn't be running these models on laptops with their thermally constrained form factor, and we shouldn't expect quick inference on a par with a large cloud-based platform either, at least not for near-SOTA model quality. It's still worth it to avoid becoming massively reliant on centralized services.

greenavocado 4 hours ago

I have a 5070 12 GB laptop GPU and can hit 72 tokens per second in the first couple thousand tokens before dropping to mid-high 50s after about 15k context.

This setup is extremely optimized down to the last flag. Changing any param above the temp flag craters performance.

I don't have enough system RAM to properly handle the large context windows so I don't use local models.

  # 1,257 tokens 17s 72.18 t/s

  $env:CUDA_DEVICE_SCHEDULE = "SPIN"
  cd D:\src\llama.cpp\
  .\build\bin\Release\llama-server.exe `
    --port 8080 `
    --host 127.0.0.1 `
    -m "D:\LLM\Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf" `
    -fitt 2048 `
    -c 98304 `
    -n 32768 `
    -fa on `
    -np 1 `
    --kv-unified `
    -ctk q8_0 `
    -ctv q8_0 `
    -ctkd q8_0 `
    -ctvd q8_0 `
    -ctxcp 64 `
    --mlock `
    --no-warmup `
    --spec-type draft-mtp `
    --spec-draft-n-max 2 `
    --spec-draft-p-min 0.1 `
    --chat-template-kwargs '{\"preserve_thinking\": true}' `
    --temp 0.6 `
    --top-p 0.95 `
    --top-k 20 `
    --min-p 0.0 `
    --presence-penalty 0.0 `
    --repeat-penalty 1.0

themanualstates 4 hours ago

That’s useless without describing WHY you chose those flags, and how you did the optimisation…

halJordan 2 hours ago

The switches are all in the -h of llama.cpp (although the maintainers have a tendency to use the word in its definition). The actual values are essentially just what alibaba recommends. So you just need their model card. I would not call it highly optimized, more appropriately tuned.

greenavocado 1 hour ago

I found every possible flag and its description including CUDA related environment variables and went back and iterated with Claude Opus 4.8 High until every single flag mattered above the temp one.

nateb2022 4 hours ago

I get over 100 tok/s sustained on my M4 Max and M5 Max, in MacBook Pro's. LM Studio + MLX.

Terretta 3 hours ago

With Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf?

Also, funny lumping the M4 "and" the M5, I find them 15% to 45% different performance, depending.

And for a good deal of work, an M3 Studio Ultra outpaces the M4 and ties the M5 on single work at a time, outpaces both doing multiple work at a time.

ridiculous_leke 3 hours ago

Can you comment on the quality and accuracy of it? People have managed to run Gemma 26b without GPU on old CPUs but I don't think quality is anywhere close to what Gemma 12b offers.

mattmanser 4 hours ago

That's a quant 4 which the thread OP specifically called out as rubbish.

The Q4_K_XL bit for those not in the know.

stymaar 3 hours ago

Anyone calling Qwen3.6-35B-A3B-Q4_K_XL “rubish” has no idea what they are talking about.

embedding-shape 3 hours ago

I'd agree that the quality degrades a lot between Q8 and Q4, borderline unusable as they start to fail with tool calling syntax even. Personally I'd say Q8 is as low as you want to go.

c0rruptbytes 3 hours ago

q4 isn't rubbish, but it's a compromise for a good value, q6 is essentially a no-compromise quantization and it's what i recommend for MoEs in my experience for agentic workflows

greenavocado 3 hours ago

He's probably calling me out for this comment https://news.ycombinator.com/item?id=48557579

greenavocado 3 hours ago

I typically find myself using a context of between 150-500k with GPT models so local models are simply not enough and I stopped using them.

stymaar 3 hours ago

That's way higher than their optimal ceiling (and absolutely suboptimal from a token cost point of view), why are you doing that?

greenavocado 3 hours ago

You're 100% right and its even severe than that: I daily drive on xhigh. I really try to avoid it, but when reconciling APIs across two large codebases you really start pressing north of 200k. I find myself topping out at 800k sometimes and that's with careful context management. I actually had to drop to GPT 5.4 for 1M context in my subscription because GPT 5.5 tops out at 272k. Hitting 800k context is better than repeatedly hitting let's say 200k out of 272k with multiple rounds of compaction. I run Can's snapcompact and while its better than normal compaction it still lobotomizes the model more than running with a very high context window.

c0rruptbytes 3 hours ago

large contexts degrade the performance - attention doesn't work will for large windows like that and cloud models are kind of hacking it

local models do involve some context engineering to get it okay, but it's not that rough

adam_arthur 4 hours ago

Gemma 4 is particularly good at pipeline/automation tasks.

It outperforms all the Qwen models (even 100B+) for rule following/automation style tasks in my experience. Its image interpretation is also very good, and out-benchmarks Opus.

Qwen seems to ignore instructions and consistently outputs incorrect formats (when token generation format is not explicitly constrained)

But yes, on the DGX Spark Gemma 31B Q4 with MTP runs around 20 tok/s and Gemma 26B A4B around 60 tok/s. Still quite slow. But on a high end Nvidia card would run significantly faster and still fit in memory.

I'd recommend for anyone getting into local models to focus on memory bandwidth over RAM. Models under 100B parameters are now sufficient and hugely useful for automation.

I agree that for coding/creation use cases, there's still not a compelling argument for local models.

But e.g. if you want to scan a list of stocks and interpret news/high pass filtering, interpreting logs, interpreting screenshots, the local models are more than sufficient already.

dstryr 4 hours ago

This is not my experience at all. Even the Nous Research guys have stated that "Qwen3.6-27B is the canonical local model to use Hermes Agent with" [https://old.reddit.com/r/LocalLLaMA/comments/1sz2y76/ama_wit...]. I am finding the same when used with Pi and OpenCode.

Gemma will just stop mid-tool call. It's been slower and I've had to reduce context size to run it. Qwen3.6 27b has been rock solid using club 3090's single card setup for agentic use -- https://github.com/noonghunna/club-3090/blob/master/docs/SIN...

adam_arthur 4 hours ago

I'm talking about automation generally, not agent loops.

E.g. prompt A to achieve X, output in format Y. Use Y to do something in prompt B.

Agentic loops will underperform deterministic control flow pipelines (with non-determinism constrained to LLM calls).

Agents are more general, which is the main advantage. But inherently a more general solution will waste context on unnecessary reasoning.

Try asking the smaller Qwen models to output a JSON in a specific format. It basically can't do it consistently with a moderately sized prompt unless you constrain the token generation via GGML or are extremely repetitive and specific about it. (Thinking disabled)

Gemma 4 will do it correctly pretty much 100% of the time. (Thinking disabled)

Applies to other rule following as well in my experience.

Qwen may be better at toolcalling and certainly probably codegen.

It seems to me Google explicitly designed Gemma for edge device automation, and didn't fine tune for agentic or coding use cases.

trouve_search 4 hours ago

On a 5090, gemma4 26B runs at 350TPS with the command below [1] and gemma4 31B is around 150TPS with a similar command.

I'm really surprised how much slower a DGX spark is for the same price.

1. Here's my command.

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ vllm serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit \ --dtype auto \ --gpu-memory-utilization 0.95 \ --kv-cache-dtype fp8 \ --enable-chunked-prefill \ --enable-prefix-caching \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser gemma4 \ --reasoning-parser gemma4 \ --max-num-batched 16000 \ --max-model-len 64000 \ --max-num-seqs 12 --speculative-config '{"model": "./gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 4}'

adam_arthur 4 hours ago

Yes, I'd recommend a 5090 over the DGX Spark if your goal is general automation.

You can run multiple instances of these models in parallel on the DGX Spark which somewhat mitigates the difference if your task is parallelizable.

But I'd take the simplicity of a single thread and higher throughput personally.

Overall of course still better to wait for next gen devices if you can.

ozim 2 hours ago

I was expecting DGX Spark to run Gemma 31b Q4 much faster.

I was expecting it would run Q8 in 50 tok/s.

I guess that’s good I stopped thinking about buying it because I would be disappointed.

gopher_space 4 hours ago

In my mind it’s a question of knowing what you want to build and how to divide the project into tasks your local setup can handle.

If you don’t need the machine to respond instantly (or explain your own business model to you) everything can be local and it’s been like that for a few years now.

msp26 3 hours ago

Yep agreed completely. I couldn't imagine torturing myself with a small model for local coding. But Gemma 4 31B is so fucking good for a variety of language modelling tasks.

freehorse 1 hour ago

> You have MoE models (gemma 26b, qwen 35b, north mini code 30b) who are pretty fast, but make a lot of mistakes

This is sadly also my experience. I wish we had some MoE models with a higher ratio of active parameters per total. My experience is that the newer MoE models that can run in a 64b laptop have too few active parameters to be useful outside narrower, specific tasks. Mixtral 8x7b was a 14b active parameter (56b total) MoE model a few years ago and was probably the best model one could run in that range for some time, but it is too old now.

I have been using the qwen 27b and it is great, but running a dense model like this in a macbook is a bit suboptimal, and i wish I could run sth faster than 15 tok/s.

c0rruptbytes 1 hour ago

I would try a 6-bit MoE and maybe with unsloth's studio, they claim to have auto tool fixing which is where i see a lot of issues with MoEs

I'm on a 48gb M5 Pro right now and it's been okay, a lot of my rough experiences have been with MLX and I'm finding that GGUFs are okay now

smcleod 14 minutes ago

Those dense models are pretty fast with MTP now. 40-70TK/s depending on your machine, that's faster than cloud models (although not as smart obviously).

Stagnant 1 hour ago

I've been using unsloth/gemma-4-31B-it-qat-GGUF daily for various small parsing and programming tasks using opencode and llama-server's front end. The past couple of weeks have made a big difference after google released the QAT variant and llama.cpp got support for MTP which means it is possible to now get 60-80 Tok/s with RTX 4090. The model fits in VRAM comfortably enough to keep it loaded even while browsing and having multiple programs.

lmedinas 9 minutes ago

the problem of that setup is that it will run out of context pretty quick. So for coding agent it will limit your workflow very fast.

amdivia 1 hour ago

110+ Tok/s as another data point on the RTX 5090 (Gemma 4 31B QAT + MTP at UD-Q4_K_XL) (at peak used 27 GB of vram)

The real lovely thing was getting 300+ Tok/s (Gemma 4 26B QAT + MTP at UD-Q4_K_XL) (at peak, I think I saw vram usage reach 21 GB of vram)

hnlmorg 2 hours ago

To be honest even the cloud models are a hot mess at times. This week I’ve spent more time rejected code from OpenAI models than I have approving it.

In fact it really feels like OpenAI models have taken a nose dive this week compared with Claude. At least for my specific workloads (these things are so variable it’s like trying to compare Google results…)

heipei 5 hours ago

Depends on what you mean by "local". On your Macbook, large dense models like Qwen 3.6 27B will be slow, sure. On a local workstation with a dedicated RTX card you can get > 100 tps, which is more than good enough to work with it, and faster than cloud models in many cases.

jstanley 5 hours ago

But how smart is it? All the people running local models never seem to mention that they are way dumber than cloud models.

I don't care how many tokens per second of nonsense it can generate.

throwawayffffas 3 hours ago

Qwen 3.6 35b a3b is about as good as sonnet 4.5. It varies but it's at that level.

lelanthran 2 hours ago

> But how smart is it? All the people running local models never seem to mention that they are way dumber than cloud models.

Well, you aren't going to give it a 20k line sec and have it churn out a full app after 4 hours hours.

But, you can get it to write code for you if you do the design.

notnullorvoid 4 hours ago

Quantized Gemma 4 26B is as smart or better than GPT 5 in most of my testing. Granted GPT 5 is nearly a year old at this point, but I can run Gemma 4 on a ~6 year old consumer GPU (RTX 3090) and get 140 t/s.

heipei 4 hours ago

It is smart enough that I use for all my coding tasks, and a lot of other mundane tasks.

It is probably not smart enough for "design this whole architecture of this complex system from scratch, make no mistakes", but that is not something I want from a coding tool anyway. I want a model that I can point to a file and tell it to make some changes to the file and related files. Or that I can ask to review a PR with regards to certain aspects.

My suggestion is to simply try it and see what it feels like.

myaccountonhn 4 hours ago

Its not going to be as good as Claude, but if you know what you're doing, it may be good enough to get your work done.

data-ottawa 4 hours ago

This is task dependent.

I find devstral (even though it’s weak generally) much better at writing and documentation than Opus. I’m actually now delegating all documentation to devstral and away from Claude, which makes a mess.

garciasn 4 hours ago

A highly skilled carpenter may be able to 'get work done' by banging nails in with a heavy-bottomed cocktail glass, doesn't mean it's not painful to do so when it is continuously breaking and leaving shards of glass all over the workshop for you to find every day for the rest of your life until you clean up the mess you made using the wrong tool for the job.

sgt101 2 hours ago

If someone comes into the workshop and takes all the tools (hello Donald) then having a cocktail glass to hand might be a bit of a lucky break.

(geddit?)

CamperBob2 4 hours ago

More like, a highly-skilled carpenter can work miracles with a $6 hammer from the hardware store, while the pros on the commercial crew are using fancy compressed-air tools.

The carpenter has to get up close and personal with the wood. He can't match the crew's throughput, but maybe that's not what he's trying to do.

c0rruptbytes 3 hours ago

I'm talking about the common use case that I think hacker news people have:

you get a macbook for work, you run the macbook

they're not going to start giving GPUs to employees to run local models

robomartin 1 hour ago

> On top of that, your laptop becomes a loud hot churning machine, it's uncomfortable to work with.

Laptop?

OK, I've made that mistake before. I understand modern laptops are powerful, but nobody wanting to do serious AI/ML work should be using a laptop for anything other than SSH or similar low-performance access into a proper system.

Years ago I fried two laptops just doing finite element analysis work running 18+ hours per day. It was one of those "I'm giving you all she's got, Captain!" workloads. They fried, even with powerful fans cooling them. I should have known better. Such workloads belong on purpose built systems.

ridiculous_leke 3 hours ago

A median laptop is no bueno for running a reliable model(which will be qwen 27b as per my reading here and r/localllama). Powerful macs would be prevalent in certain areas of the world but in rest of the world personal machines aren't always that powerful.

FuriouslyAdrift 3 hours ago

Kimi 2.6 or 2.8 is what we are playing with locally. They need 512GB to 1TB to run with full capabilities so that's not exactly "desktop"

Our GPU computer server cost $110k.

atomicnumber3 3 hours ago

I largely don't disagree with you but come to a different conclusion. I have two systems:

1) a "programming desktop" with a $500 upper mid range Ryzen (idr exact), 8GB VRAM Radeon card I bought solely for RuneScape, and 64GB ram

2) a maxed out Alienware 16 Area51, so it's a 5090 with 24GB vram and 64GB system ram. I bought it for gaming, of course.

I run qwen 3.6 35B A3B Q6 with 200k context window. I compare this to Claude pro max or whatever that I use at work.

The main difference between the machines is that the one with the RuneScape gpu does 10 TPS while the Alienware does 30-40tps. Both are fine though the 30-40tps is obviously a lot snappier.

I find with both models that:

- they do really well at "be a 30GB zip file of reddit and stackoverflow answers"

- they do really well at point fixing random bullshit errors that would otherwise waste my time (this is related to above of course)

- they do quite well at, given a pretty good specification of what you want, figuring it out, even if you've specified several steps needed

- they both cannot really be given a large ish task and left to just drive it on their own

The main difference between the two is with that last one, Claude is somewhat better and figuring SOMETHING out, but if Claude is having to figure it out, it's probably because I don't know what I want and it's very likely to not make a sane choice, and will generally produce slop given even the slightest amount of leash still.

I've also found that the boundary between "well specified small to medium thing" and "idk just do thing and figure it out" is the difference between you keeping control of the code and losing control. There's an "escape velocity" of AI use that, when you hit it, you're doomed to slop forever. (Or you have to deorbit... enjoy that). And while claude might have slightly higher velocity allowed while remaining suborbital, it's very diminishing returns.

So, are these models "worse" than Claude? Yeah. Am I looking forward to continued improvements? Yeah. But I now also have no desire to pay anthropic any amount of money, which has the nice side effect that i won't be helping them end up with so much money that they can distort our democracy.

everdrive 4 hours ago

What counts as a lot of memory? What could someone do with 16 GB of RAM?

throwawayffffas 3 hours ago

Not much, the capable models won't fit unless you go with very low quantization but that leads to a lot of loss.

You generally want to run q8 or some kind of "6bit" quantization at least.

40GB of VRAM is the entry-point in my experience, you can run qwen 3.6 35b a3b with full context or qwen 27b with about 92k of context.

Before you get fully discouraged, you don't need 1 gpu with 40GBs you can use multiple cards, with minimum impact on performance.

zozbot234 4 hours ago

Modern inference engines can stream in weights from SSD in order to save on RAM, but this makes inference very slow, especially for the trivial single-session case. (Jury is still out on whether batching multiple sessions together can mitigate this well enough, but even then that's mostly helpful for the "running lots of inferences overnight and getting fresh results first thing in the morning" case. Which is interesting (the big third-party suppliers don't really offer a way of doing this at reasonable cost) but a bit of a niche.)

abalashov 4 hours ago

Not a ton. I'd say 64 GB minimal to play, 96-128 GB better.

throwawayffffas 3 hours ago

Nah, you can run the 24b - 35b class with between 90k and 256k of context with about 40GB and they are pretty good. Especially the MOE variants fit neatly in 40GB.

ValdikSS 4 hours ago

Gemma e2b, Gemma e4b. It's made for smartphones basically. You can run e2b with 8GB RAM.

trouve_search 4 hours ago

gemma 12B 4bit quant; try something with MTP and an AWQ quant

monegator 4 hours ago

gemma runs pretty well

greenavocado 5 hours ago

4 bit unsloth quants are good if you never ask for more than 20k context, use it as autocomplete on steroids, and never delegate serious questions to it

iwontberude 4 hours ago

They are good if you were clever enough to buy a powerful enough rig before memory went up. For everyone else I say just wait. M1 Ultra 128GB and higher is sufficient to run gemma4:31b-mlx or qwen3.6:35b-mlx with subagents. It’s only slow if you don’t know how to plan your work effectively.

dominotw 4 hours ago

maybe painful if you are using it like a chatbot. you are sitting there waiting for response. vs ambient ai like automatically classifying your family pics and discarding random things like parking floor number pic.

i use it usecases like that latter and they are fine.

citizenpaul 2 hours ago

They are still terrible at tool usage which loses 99% of the effectiveness of the agent. I've had to concede and use paid frontier models that can use tools or its not worth using agents....copy...paste....copy....paste....