Running local models is good now | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	Running local models is good now (vickiboykis.com)
	231 points by jfb 1 hour ago

28 comments

c0rruptbytes 10 minutes ago

I don't know about good, I use a lot of local models and they're still pretty painful to run locally

You have dense models (qwen 27b, gemma 31b) who are pretty smart, but pretty slow

You have MoE models (gemma 26b, qwen 35b, north mini code 30b) who are pretty fast, but make a lot of mistakes

You need a lot of memory to run these well, quantization makes tool calling weaker, so most run at 4 bit quants and are wondering why it kinda sucks and that's because you've essentially lobotomized the model (I recommend unsloth quants, i recommend 6bit for MoEs and 5bit for dense)

So you need a lot of compute to make the pre-fill fast, you need bandwidth to make the decode fast, you need a lot of memory to hold everything - lot of ifs

On top of that, your laptop becomes a loud hot churning machine, it's uncomfortable to work with.

So are they good? not really. Do they work? yes

zozbot234 3 minutes ago

Maybe we shouldn't be running these models on laptops with their thermally constrained form factor, and we shouldn't expect quick inference on a par with a large cloud-based platform either, at least not for near-SOTA model quality. It's still worth it to avoid becoming massively reliant on centralized services.

heipei 4 minutes ago

Depends on what you mean by "local". On your Macbook, large dense models like Qwen 3.6 27B will be slow, sure. On a local workstation with a dedicated RTX card you can get > 100 tps, which is more than good enough to work with it, and faster than cloud models in many cases.

jstanley 2 minutes ago

But how smart is it? All the people running local models never seem to mention that they are way dumber than cloud models.

I don't care how many tokens per second of nonsense it can generate.

greenavocado 3 minutes ago

4 bit unsloth quants are good if you never ask for more than 20k context, use it as autocomplete on steroids, and never delegate serious questions to it

hypfer 53 minutes ago

After having been a happy user of Qwen3.6-27B for a few weeks, due to being away from the hardware, I'm currently forced to use Claude Sonnet 4.6

It is such a downgrade. I don't understand how that's even possible. The thing has so many strongly-held opinions I did not ever ask it for, talking just way too much and generally feeling somehow dumber.

Of course, being significantly larger, it will encode more knowledge, but that doesn't help me when I hate talking to it. And all that on top of the fact that talking with it costs real money.

I wonder what it might be that makes me hate it so much. Maybe because it doesn't see itself as a tool but almost an equal? As if its opinions would have weight.

Qwen too can act like an overeager intern, but if you tell it that it is an idiot, it will drop that ego. Not so much with Claude. In my experience, anyway.

Anyway, point is: full ack on that headline.

ggerganov 19 minutes ago

I haven't spent a dime on cloud inference, so cannot make a direct comparison like you. But I can 100% attest to the fact that Qwen3.6-27B is a very capable local model for coding tasks. Over the last month and a half I've been using it almost daily, either on my M2 Ultra or on my RTX 5090 box. I use it for small mundane tasks at ggml-org [0] - nothing really impressive, but definitely a helpful tool for a maintainer. I think I would be using it much more, if I didn't have to spend a lot of my time on reviewing PRs. Currently, I have a very lightweight harness - the pi agent with everything stripped (`pi -nc --offline`) and a short system prompt [1] to align it a bit with my style. About the generation speed: ~100-150 t/s on the RTX 5090 and ~40 t/s on the Mac. I definitely prefer running it on the RTX machine - it's so much faster. But for the sake of testing and getting wider experience with local configurations, I often run it on the Mac too.

[0] - https://github.com/search?q=%22Assisted-by%22+user%3Aggml-or...

[1] - https://github.com/ggml-org/llama.cpp/blob/master/.pi/gg/SYS...

celrod 2 minutes ago

What quant do you run it at? 32GB seems like cutting it close on the rtx 5090 if going 8b, but other commenters are saying 4b lobotomizes the model.

ltononro 1 minute ago

Well but comparing with sonnet 4.6 instead of opus 4.6,.7 or .8 doesnt make a real point I mean, pay 200 USD/month (if you have that cash, or your company has it), might not justify using local at all (unless you have some reason to suspect about data leakage)

StevenWaterman 45 minutes ago

Yep, I daily drive Qwen3.6-27B (including for work), have done pretty much since it came out. IMO it's the only (small-ish, local) model worth using, if you can run it. It might not be as good as Opus at "add X large feature" but I don't want that in a model. I want to do the thinking while it does the typing. And Qwen 3.6 27B is perfectly good at that (while in my experience models like the 35A3B and gemma are significant downgrades)

Plus, I never have to worry about rate limits, quotas, or sitting in a queue during peak time. And I can always see its full thoughts, don't have to worry about where my data is getting sent, and know it can't get secretly nerfed.

Running on 2x 3090, 500-1000tok/s prefill and 60tok/s output at Q6_K_XL with MTP on llama.cpp, 220k tokens context window (starts to get a bit dumb above 160k ish), no KV quantization

QuantumNoodle 4 minutes ago

Do you have any resources on hardware necessary for running models and tweaks? I see you mention 2x 3090 and I wanted to do more search on what hardware is satisfactory for what models.

indoordin0saur 30 minutes ago

> And I can always see its full thoughts, don't have to worry about where my data is getting sent, and know it can't get secretly nerfed.

For this reason I wonder if local models are a potential business opportunity. Provide the service to engineering teams to give them a pre-built and setup GPU rig they can run in a closet. No need to worry about all the things you mentioned and clients can rest-assured their data isn't disappearing into a sketchy data center. There might be regulatory reasons that make on-prem setups appealing as well.

amoshebb 28 minutes ago

This is, as far as I know, the business model of coys like mistral and cohere

suncemoje 9 minutes ago

On-premise (1960-2010) -> Cloud (2010-2026) -> On-premise (2026+)?

cyanydeez 9 minutes ago

I think the next step to anyone but overbloated USA models is to follow https://chatjimmy.ai/ with one of the qwen models. If they can mass produce something at relative cost, these would be awesome sidecars.

hughw 13 minutes ago

Just this morning I tweaked my single 3090 setup too:

  OLLAMA_FLASH_ATTENTION=1
  OLLAMA_KV_CACHE_TYPE=q8_0
  OLLAMA_CONTEXT_LENGTH=180000

and that fits in 23GB.

[edited for format]

giancarlostoro 38 minutes ago

> (starts to get a bit dumb above 160k ish)

If open models can ever hold roughly 600k token windows, I'll be really excited, I found that around 300 ~ 400k of Claude reading through your codebase results in better outputs. I also have Claude read official docs instead of "guessing" as to how to do something.

StevenWaterman 35 minutes ago

I think we'll get there. Right now it works for me, because I'm naturally pretty verbose in my prompts, and know the codebase well, so I know what it needs to look at. Plus subagents for anything exploratory.

I think deepseek v4 pro has 1m context and does pretty well up to around 600k. But if you have the hardware to run that locally, you already know

Even then if there's a smaller model with 1M context, you'll need a ton of RAM to actually run it at full 1M. I guess that's why you don't see it too much. Anyone that could run Qwen 3.6 27B with 1m context would be better off running a much bigger model with smaller context instead, in the same amount of VRAM.

In terms of optimizing further, huge context + KV quantization sounds like a terrible idea, but there's some decent innovation in sparse attention, KV cache rotation allowing Q8 to perform nearly as well as full 16-bit precision, plus some ideas around offloading KV cache to system RAM (but I'm skeptical)

cyanydeez 2 minutes ago

I don't really think you're making reasonable decisions at that size; but I suppose if you're not allowed to refactor it, maybe.

I think the way these models work excludes sane behaviors the larger the context gets as each token introduces potential ambiguities between "USER" and "SYSTEM" messages leading to all the catastrophic behaviors.

Anyway, with AMD395+ I'm finding ~100k is both speed and context usefulness unless it's scoped tightly. with opencode, I manage it with dynamic context pruning: https://github.com/Opencode-DCP/opencode-dynamic-context-pru... ; then anything I touch ends up being refactored so context doesn't get bloated with unecessary functions, etc.

Obviously, this isn't compatible with certain business codebases, so I can see why bloat meets bloat.

epistasis 18 minutes ago

> talking just way too much

OMG this is such an annoying property, just shut the hell up please, and be concise.

I suspect that this is an artifact of the thinking property, but please just summarize the thinking process far more concisely, where a single sentence answer is more than sufficient the frontier models seem devoted to going on to a minimum of 5 paragraphs and offering 3-5 new directions.

And requests to please only offer a single step at once, or single option at once, or to even stop eagerly offering future directions is really hard to prompt correctly.

And look, there I did exactly what I was complaining about...

illegalsmile 1 minute ago

That's why you have to give claude and others directives/.md at the beginning so it doesn't go off the deep end with suggestions.

bityard 3 minutes ago

I'm not sure to what degree you can influence how a model thinks, but you can definitely hide the thinking tokens and tell the model how you want it to talk to you.

For example, the Claude web UI has an Instructions field where I have told it never to congratulate or praise me for asking questions. Earlier Copilot models used a ridiculous number of emoji and bullet lists when answering literally every prompt, I told it to knock that off and prefer detailed paragraphs in prose.

Local agents/frameworks/whatever all have their equivalents for overall user preferences.

radium3d 32 minutes ago

If you think about it, they're splitting the power across millions of users. Essentially, these AI companies have YOUR hardware that YOU are paying (them) for in a cabinet at some data center.

That said, it does make it possible to train the models having them in the same data center. Having them distributed to everyone would slow down training considerably.

kitd 46 minutes ago

Funny that coding agents have personalities, including "that colleague" you want to avoid even if you know they're probably quite good at what they do!

MostlyStable 45 minutes ago

Curious if you have tried custom instructions. I was never quite as unhappy with Claude's voice as you appear to be, but there were several things I didn't like. A custom prompt fixed almost all of them.

clickety_clack 41 minutes ago

I think it would be very hard to convince someone to pay $100/mo to go back to Claude if they have a local model up and running, particularly now that model improvement has basically been stalled for the last 6 months. It’s so easy to set it up for yourself now too with things like LM studio. That said, there will always be unsophisticated users who can’t figure it out, so there will always be someone there to pay.

MostlyStable 11 minutes ago

The person I was replying to specifically said that the Claude will "encode more knowledge" and that their problem was that they didn't like talking to Claude. It sounds like they think that Claude is at least slightly more functional. And the "not liking talking to it" is probably fixable. Someone for whom a local model works, and for whom the economics make sense, should absolutely run a local model and I wouldn't try to convince them otherwise. I'm sure it's the right choice for a lot of people. But not liking the personality of Claude is probably not a great reason on its own, given the minuscule amount of effort it takes to fix.

Scoundreller 27 minutes ago

The third category are the occasional users that won’t have the hardware and won’t stomach a monthly fee for “unlimited” but are happy to pay-per-use.

I’d think the volume for that category would be low but LLMs aren’t just for coding.

dghlsakjg just now

I’m probably the third category. I like experimenting and trying different models and techniques. I want api access for my own apps and Claude subscriptions don’t have that.

Sure I could splash out a ton of money for a high ram Mac, but deepseek is so dirt cheap that I think depreciation on a high end machine costs more than my api spend.

Example of what I’m using it for: building a semantic database of podcast content (podcast discoverability sucks on an episode level). I need a cheap LLM, an embedder, a transcriber, none of which Claude will do.

My api costs for coding agents plus running apps are about ~$20/month, but I get more than just chat + Claude code.

If all I was doing was pumping an employers codebase through a coding agent, Claude would be the answer.

chrisweekly 32 minutes ago

Not everyone has the right hardware.

clickety_clack 18 minutes ago

I guess I’m thinking of the $100/mo users, for whom it’s probably possible to get the right hardware.

derethanhausen 44 minutes ago

I would not generalize based on experiences with Sonnet. The flagship models (Opus being the claude equivalent) are dramatically better.

hypfer 42 minutes ago

Opus in my experience is equally unpleasant "character"-wise, but at least it actually gets stuff done more often, so it's at least slightly more earned at that. It's still a neurotic cargo-culting dogmatic idiot, but one that at least sometimes does produce deliverables instead of only bottom-tier HN-esque opinions.

Hmm. I think I might just fundamentally disagree with Anthropic about the idea of what a "tool" should be.

giancarlostoro 40 minutes ago

There's a model on Huggingface where someone takes Qwen and makes it think Opus style, and that one seems to be decent, not sure if they have the 27B variant in that style. I do wonder if you can tweak your system prompt to force Qwen to behave better?

StevenWaterman 32 minutes ago

You read the OP backwards, they said Sonnet is a downgrade from Qwen, and prefer Qwen's tone

giancarlostoro 2 minutes ago

Sure, but my argument still holds, the idea is that Qwen reasons the way that Opus on High (what is now Max or whatever?) level thinking to reason about problems instead of its standard approach.

whythismatters 31 minutes ago

Yes, Qwopus :) I've been pleasantly surprised by its quality

giancarlostoro 1 minute ago

Seen that one too, same guy I'm thinking of too, havent had a chance to try all of their models. For anyone curious I believe the username is Jackrong on huggingface? They've got several models out on there each focused on programming from different approaches.

dackdel 18 minutes ago

what kind of hardware do you need in order to run qwen3.6-27b

iagooar 13 minutes ago

I recommend MacBook M5 Max with 128 GB of RAM to run it comfortably and fast. If you have something like a regular M4, go with qwen3.6-35b-a3d - the Mixture of Expert architecture makes it run 2-3x faster than the 27b version.

sbmthakur 7 minutes ago

I could run it on 7900 XT with 64k context. You could run it more comfortably on a 24 gb vram.

swatcoder 33 minutes ago

Using the first-party Claude Code SKILLS as a signal for what Anthropic is training Claude itself to be good at generally, my sense is that they're crafting a tool that's very good at reproducing cargo cult practices that are only suited for certain use cases but that then become the default which many naive users will misapply and many savvy users have to swim upstream against. Yet those practices are suitable for some projects and so some can end up with genuinely impressive examples to showcase.

Ultimately, the whole concept of a singular right way to do things in our craft is absurd, and the popular/cargo-cult "best practices" of the 2020's era were already pretty questionable before they started getting burnt into everybody's favorite $1T helper as its default preference.

Other models may struggle to acheive parity on some of Claude's best fit showcase examples and some benchmarks, but their presumably weaker training may prove advantangeous as a more agile starting point provided their general capabilities prove strong enough to write sound code.

indoordin0saur 36 minutes ago

Very curious what hardware you're running this on!

hypfer 32 minutes ago

The same 24GB VRAM RTX 4090 I bought to play Cyberpunk 2077 with.

Works perfectly fine in llama.cpp throwing 70+t/s at me with 128k q8 K/V context when using the IQ4_NL quant + MTP at q4 MTP K/V.

Also leaving this here because you might find it useful: https://hypfer.github.io/will-it-fit-llama-cpp/

indoordin0saur 23 minutes ago

Nice! Do you do anything with that compute when you're not actively using it? Is the crypto-mining hobby still worth it? I've also wondered if such expensive hardware can be rented back out to offset cost. Looks like these cards are going for as much as $4k nowadays.

all2 18 minutes ago

There are services where you can hook your card up and rent it out to other users. I don't know what any of them are called, but they do exist.

hypfer 21 minutes ago

I've paid ~2k€ in 2023. Since I'm usually sitting next to it, I'm only using it when I want to use it. It can get quite loud and warm.

Crypto (to my knowledge at least) moved away from GPU mining. I guess you could maybe rent out GPU compute, but - being in germany - it's not worth the legal hassle. You could of course always commit tax fraud, though I wouldn't recommend that.

esseph 12 minutes ago

> I've also wondered if such expensive hardware can be rented back out to offset cost.

Massive legal liability. Not worth it.

cdelsolar 25 minutes ago

What did you call me?

chrisweekly 33 minutes ago

Why Sonnet 4.6 not Opus?

iagooar 4 minutes ago

I love running two models locally: qwen3.6 27B 8bit (dense) and qwen3.6 35B 4bit (MoE).

The 27B is the smarter, more reliable one - but it is slower. The 35B is faster, still very smart but below 27B, a bit less reliable. The reason is the MoE - Mixture of Experts architecture, which only activates a subset of parameters, making the model much much faster.

I run the 27B on a MacBook Pro M5 Max + 40 GPU cores + 128GB RAM (well, on this beast I can have 27B + 35B in memory at the same time with headroom for all the other stuff). But because this is a laptop, it is not possible to run local LLMs all the time - it just gets too hot and too loud.

What excites me more: I run the 35B model on a MacMini M4 with 64GB RAM. It is fast, it gets a lot of work done (e.g. it scans, extracts and classifies my emails, it watches the mailbox all the time and does work). I also use it as my private Hermes assistant ("when is the next Starship launch?", "who is playing today at the World Cup? Give me some trivia").

Next step I am planning is a RTX Pro 6000 Blackwell workstation I can put in my basement. I want to run qwen really fast, with multiple threads / prompts / agents at once. And MAYBE if the budget allows, a 2x RTX Pro 6000 setup in order to run DeepSeek v4 flash on it (to run research on it).

Barbing 1 minute ago

Did you get a Brave search API key or something for that “Hermes”?

rmunn 56 minutes ago

This is the kind of thing that Anthropic et al should be worried about. As it becomes easier and easier to run local models, the ceiling of what they'll be able to charge will get lower and lower. Not that nobody will be willing to pay $$$$$ per month, but a lot of people are going to multiply the per-month charge by 12 or 24 and say "Could I set up a local model for less than that, and have it pay for itself within a year or two?" And if a significant portion of customers decide to buy instead of rent, the companies whose business model is entirely centered around renting will suddenly find themselves hurting for customers.

sathackr 42 minutes ago

The opposite of that has been happening for 20 years now with cloud compute.

It won't happen with AI models either.

It's almost ingrained in the American business model now. Outsource everything. Nobody wants to manage a room full of servers when they can spend 2-3x as much and outsource that headache along with the responsibility for it.

Same will happen with AI. Whether that means paying Anthropic that premium or paying AWS.

I'm in a relatively small business, we recently had an outage related to our local infrastructure.

I got pressure from the CEO saying it wasn't reliable to host our own infrastructure anymore even though our total internal down time over the last 5 years is significantly less than even a single lf the larger recent AWS outages.

Everyone wants to shuck the chore and the responsibility.

TkTech 3 minutes ago

For many companies (country-dependent) that's not really why they use cloud services vs purchasing. It's tax shenanigans and business process overhead. OpEx vs CapEx, and a small (%) bump in the huge AWS bill no one will even notice or a $30k+ invoice for hardware that has to go through rigorous review and 3 departments.

Same reason people pay for things through the AWS marketplace (like Vanta) instead of having to go through their invoicing process.

preommr 20 minutes ago

> The opposite of that has been happening for 20 years now with cloud compute. It won't happen with AI models either.

AI is different.

Cloud computing genuinely is cheaper on average. It's better than paying for cisco servers, and at scale, it's cheaper than managed platforms (ala Heroku), and it's a coin toss for when you're in the middle ground and constantly approaching the point of rebuilding poor-man versions of existing products but with very very expensive engineering salaries.

In contrast, local models offer dramatic savings, and are magnitude of orders better in certain aspects: like stability - the performance is all over the place with traditional AI companies as they divert compute to their next big thing.

The benefits to maintaining your own infrastructure are pretty moderate to low, with very high risk.

And also, alternate models are pretty easy to use and easy to swap out unlike the vendor lock-in that exists with cloud services.

dreambuffer 26 minutes ago

It's just not comparable though is it? You need cloud services because it's physically impossible to use your single home computer as a server, CDN, load balancer, mass storage, security service, and distributed system.

But AI is just weights, you can run a reasonably intelligent model at home, or on a few GPUs if you're a small-medium sized company, and it doesn't require dedicated maintenance.

cheema33 27 minutes ago

> I got pressure from the CEO saying it wasn't reliable to host our own infrastructure anymore even though our total internal down time over the last 5 years is significantly less than even a single of the larger recent AWS outages.

Same here. My job as a software dev does not require me to self-host services we need and use. Quite the opposite. But, I am reluctant to hand over all control to AWS or equivalent for several reasons that I will get into here.

I have found that Infrastructure as Code (IaC) and modern tools like opentofu, ansible, combined with frontier AI models and harnesses gives you superpowers in this space. Almost all of our self-hosted services are fully managed by these tools. e.g. We perform backups and test them more often now than we ever did before. Entirely because it is so much easier to do all of that now.

derfurth 30 minutes ago

That's an interesting take, however there is no ongoing maintenance related to local models, maybe the only effort is giving more capable machines to the workforce; but yeah I can see how it might feel like a barrier.

sathackr 12 minutes ago

The hardware, the power systems, the cooling systems. They need maintenance.

The OS needs updates, file systems get corrupted.

Fans get dirty.

All the things that you need to deal with in hosting your own server infrastructure you have to deal with when hosting your own AI infrastructure (which runs on servers...)

davidw 14 minutes ago

Still though, perhaps the existence of low-margin, generic, cloud LLM's puts some downward pressure on the 'brand name' companies?

sbmthakur 1 minute ago

[delayed]

wuliwong 38 minutes ago

These local models can do some of the work the non-frontier models can do but for me, that's not worth much. If I am just using Sonnet 4.6, I can pretty much work all day on the $20/month plan. And Sonnet is still a way more powerful model than a one you could self host on an M2 mac.

If things change to token usage billing for everyone, maybe I'll be singing a different tune but on a subscription, I don't think it makes sense financially.

Fun? Yes. Financially sound? No.

icoder 41 minutes ago

What I don't understand is that on one hand we read 'what they charge is much less than it costs them' and on the other hand this thread seems to suggest that 'what they charge is more than it would cost me'.

bluGill 19 minutes ago

What it costs is tricky to measure. A large part of the costs are training the model. Once they have the model they are making a ton of profit from what they charge (or so we think - I haven't seen the numbers). However the sunk costs of getting the model need to be paid for and that means an accounting problem where we have to guess how much the model will be used in the future.

Accountants are reasonably good at figuring this out - there are a lot of different things that need a large upfront investment before you can charge anything. People still debate if they are correct in this each case.

esailija 35 minutes ago

Bigger models that Antrophic want to sell cost disproportionately more (e.g. 100% more cost for 5% performance improvement) than small models you would use locally

indoordin0saur 44 minutes ago

I'm curious when coding-heavy companies will start running their own on-prem AI clusters. Has anyone had the idea to sell something like 4 GPU machine an engineering team could throw in a closet somewhere and run whatever they want on it? I imagine this won't appeal to everybody but with the trust issues the hyperscalers have developed hoovering up people's data and using it to train their models, I imagine some will find value in a machine and model they have transparent control over including the option to walk over and unplug the thing.

themaninthedark 54 minutes ago

Maybe that is why they are buying up as much hardware as they can? If their service is the only game in town.

otterdude 51 minutes ago

Data Center providers are buying hardware, not anthropic. Certainly related but alot of the hardware purchased is just sitting in a warehouse waiting for a data center to get built.

Tharre 5 minutes ago

I've been running Qwen3.6-35B-A3B (and 3.5 previously) locally and it's a great model for many small tasks, probably a significant chunk of what most normal people are using LLMs for right now.

But for coding in a harness? In my experience it's unusable even for small projects. It just gets hard stuck at every little problem, wasting hundreds of thousands of tokens trying to make a convoluted solution work instead of doing the obvious thing. Or it will spend hours trying to reason through a fairly simple code flow, incrementally adding debug print statements, only to get confused by the output and then editing completely unrelated code that it convinced itself is the problem.

I've tried instead giving Sonnet the problem description and code and have it come up with a detailed plan that Qwen should implement, but doing that actually consumes a significant amount of tokens compared to just telling it to implement everything, and the results are honestly not that much better. There are just too often subtle issues with the plan that Qwen doesn't recognize when implementing, but make the resulting solution it comes up with unusable.

embedding-shape 55 minutes ago

Show us the resulting code of using them! :) I want to use local models, I have the hardware for it, but while trying them out as replacements for GPT 5.5 xhigh or Opus or other SOTA models, they aren't quite ready to be replaced yet, sadly. The quality and bumps they encounter just slows down the workflow so much, even screwing up tool call syntax sometimes.

But, for smaller more well-defined workflows, or as straight "edit this part to be like this exact" edits, they seem more than enough. Still waiting for them to become mature enough to be able to replace what we have as SOTA today, I'd say it's ready to be switched over then.

Speaking of local models, DiffusionGemma (and diffusion models in general) should not be slept on for local usage! Usually the problem locally is that the LLMs aren't efficiently making use of your hardware, unless you start batching requests and run many at the same time, but that require different approaches in general. Instead, diffusion models work much faster for individual prompts, and not by a small margin either.

Today I finally finished porting diffusiongemma-26B-A4B-it support from Transformers into Candle, and together with some optimizations I now have it basically flying with ~450 tok/s (~19 it/s) in Candle during inference, instead of ~180 tok/s (~11 it/s) from HF's Transformers library. Even using vLLM with similar sized LLMs, I don't think I've ever gotten past the ~250 tok/s threshold for single prompts, exciting stuff for local models :)

zozbot234 33 minutes ago

> Instead, diffusion models work much faster for individual prompts, and not by a small margin either.

Diffusion models can't really be trained beyond low-to-mid size and have lower quality than an equally sized, plain one-token-at-a-time model.

embedding-shape 18 minutes ago

As mentioned, I've just finished the implementation and started playing around with it, seems to be doing similarly well inside of my own agent harness as similarly sized "traditional" LLMs. Of course, neither come close to SOTA models, but I suppose if we can figure out the scaling issues you mention, we'd get a bit closer. The performance just feels like it's too good to quickly ditch diffusion. Do you have more info what those "can't be trained beyond low/mid size" issues are in practice today?

zozbot234 8 minutes ago

The issues around training diffusion models are well known among researchers. They're likely to not be feasibly scalable far beyond the 26B size of DiffusionGemma itself, and their lower quality compared to an equally-sized auto-regressive model (the usual one-token-at-a-time flow) is also a matter of broad consensus.

ltononro 6 minutes ago

Good depends a lot. If you are in the token maxxing hype you will probably find these models very bad comparing to SOTA, unfortunately.

The good news might be: opensource models are now good (enough) for day2day usage. But is it really? I feel that companies will always naturally strive for the best and use the SOTA (as long it is not too expensive).

I see OSS models being a good backbone for companies in the future that have validated workflows and could use those for privacy or to spare costs.

IDK, might have gone a little bit off-topic here.

segmondy 12 minutes ago

It's more than good. As of today, it's great. Those models listed in the blog are horrible compared to what you can run today, There's absolutely no reason to run those, you have Qwen3.6, Gemma4, and plenty other sized comparable models.

If you're resourceful, you can even run SOTA models. KimiK2.7, MiMo-V2.5/V2.5-Pro, MiniMax2.5/2.7/3, DeepSeekV3.1/v3.2/V4-Flash/V4Pro, GLM5.1, Step3.7-Flash, Qwen3.5-397B, Qwen3.5-122B, gpt-oss-120B

sosodev 41 minutes ago

I think this is overselling their capabilities. I've used Gemma 4 and Qwen 3.6 quite a bit on my strix halo home server. They're great models and the dense variants are significantly better, but they're still very far behind the frontier. If you boot up Gemma 4 MoE and OpenCode/Pi and expect to perform anything like Claude Code or Codex you're going to be very disappointed.

chrismarlow9 40 minutes ago

You can use a frontier model to create a plan that's specific enough for a local model of a very small size to execute on. The more specific you are and compartmentalize tasks the "dumber" the local model can be.

Edit: Obviously you'll be using more tokens but this is the trade off for running a smaller model and running locally. Similar to time memory trade off but in token economics. Sorry I need more coffee

daniban 5 minutes ago

With Apple silicon and now the RTX Spark there are real discussions whether local AI is the future. The only problem is Western open source models are so far behind. I genuinely feel there's a push to fix this. Gemma is getting more frequent releases and Nvdia is quietly creating very cool small models. I hope both the hardware and models catch up and local really does emerge.

_doctor_love 1 hour ago

"Just get a 64GB Mac with 1TB of storage!"

LOL - some of us have a budget

swatcoder 52 minutes ago

Sure, but it's also not really out of scale with the cost of a shop tool in other trades.

If you're a professional that's confident in a positive return on the investment (optimal or not), or just a hobbyist with the luxury budget for a "shop" that cost is well within norms.

That's not everybody, of course, but it's not some inconceivable fantasy. A lot of people in the tech community here on HN, specifically, end up with pretty high discretionary budgets that they pour into stuff like this.

techscruggs 55 minutes ago

He is using a 2022 M2, which you can get that for about $2k used. That is beyond reasonable.

Shekelphile 36 minutes ago

She

psychoslave 42 minutes ago

Global Affordability Estimate:

Top 10% of global earners (~800M people) can afford a $2,000 device without major financial strain.

Top 25% (~2B people) could afford it with some budget adjustments.

Bottom 50% (~4B people) would find it prohibitively expensive.

So for a SV top income, maybe that might look more like the weekly pet brushing budget, but for most people out there this is not that much of a no-brainer.

disgruntledphd2 29 minutes ago

The maths changes if you're working for yourself. Because I live in Europe, I've ended up working as a contractor due to the lack of a legal entity in my country. While that mostly sucked for a bunch of reasons, I was able to get a 64Gb Mac M2 a few years back with approximately a 52% discount, which was kinda nice.

weego 23 minutes ago

If you're working for yourself paying monthly is exactly the same as amortising an asset. Personally I'd rather my business just pay $100 a month than have to deal with additional hardware and software maintenance while using a depreciating asset that is break-even after 3-5 years depending on the spec.

richwater 35 minutes ago

Yes, because the bottom 50%, mostly impoverished or near impoverished folks were spending money on Claude Code subscriptions instead /s

amalcon 50 minutes ago

A Strix Halo with similar RAM is considerably cheaper. Still not cheap, mind, but performance is OK (not great) and it will run more or less the same models.

AbsurdCensor 39 minutes ago

At least for me, it's been pretty great, but I bought my system when it was $1800, now looks like the same system is $2700 and out of stock. I still haven't quite been able to run 120B parameter models under Windows, but for Qwen Coder 30B, it works pretty darn well for my at home needs.

amalcon 35 minutes ago

Yeah, they have gone up a lot since I bought mine too. I did get Qwen3.5-122b running on all-GPU (on a 128GB machine) under a minimal Arch Linux setup (I do my GUI work on a much cheaper box). It worked, but Qwen3.6-35b is performing almost as well and a lot faster.

Still cheaper than a new Mac. Maybe not cheaper than a used one.

p-e-w 53 minutes ago

No need. You can run the Gemma 4 and Qwen3.5 MoE models with as little as 12 GB of VRAM at 30-40 tps (Q4/Q5), and they both blow GPT-4o and DeepSeek R1 out of the water.

tjwebbnorfolk 56 minutes ago

AI and budgets don't mix well at the moment

themythfable 54 minutes ago

Yeah, I never had a computer that cost north of $800 until recently. While that is far from the typical HN user's budget, my bet is that it is much closer to average.

Besides those with effectively unlimited budgets for their personal compute, local models are still a long ways off.

Though, that shouldn't be conflated with the value of open-source models, which can be used by cloud providers to significantly reduce cost of intelligence.

embedding-shape 52 minutes ago

> Yeah, I never had a computer that cost north of $800 until recently. While that is far from the typical HN user's budget, my bet is that it is much closer to average.

There are segments, everything from "Average person in world" to "Average creative professional using computers for work", with a wide range of costs for the hardware. HN probably skews towards the latter rather than the former, probably sitting with enterprise hardware next to them basically for fun, hard to make wider conclusions from what people here have or not.

sublinear 32 minutes ago

If we define "typical" as the median HN budget, it's probably about the same as yours. Maybe the answer would have been different 10 or 20 years ago, but the era of truly needing a big budget PC has been over for a while.

It's just for gaming and AI now. Maybe not even gaming as much anymore.

Consider the perspective of someone who has a practically unlimited budget for PCs, doesn't game much anymore, and doesn't need AI to do their job. It's just part of getting older, and there are plenty of people in their late 30s and older on here.

anarticle 37 minutes ago

Pros buy their own tools. This is why working for yourself is better than working for a corpo, you get to choose your weapon.

fridder 13 minutes ago

Is there a local harness designed around the local model use case that is claude code like? Opencode has been problematic at times, pi works for one off for me but not back and forth conversations with the LLM. Considering I only use Qwen or Gemma models I'm close to just writing my own at this point

0xc0c0c0 32 minutes ago

I have used local models (around 128 gb) and the big proprietary models, and while I do want local models to win, it's important we keep the expectations of local models realistic. There are many blog posts about how local models today can fully replace some of the proprietary models and in some cases its true for the much smaller proprietary models, its very clearly much more behind the larger models.

You can be far more ambiguous with your tasks with the larger proprietary models as opposed to the local models. You can achieve the similar results with local models but you need to be much more detailed in your prompt.

One of the biggest things about running these local models is that the harness matters almost just as much as the model too. Codex is optimized for GPT models, CC is optimized for Claude, Cursor has a great harness that works very well across these providers. It took me a couple of iterations of the different harnesses to find one that would work well with the smaller Qwen models to do local coding.

failbuffer 21 minutes ago

So which harness did you end up choosing?

anubhav200 31 minutes ago

I have been using qwen and glm based models from last 2 years, ended up buying mutiple machines for the same. Overall i feel 24vram is a must have to get get performance (speed wise) to match hosted soln. I have 2 machines a 12gb vram one and a 24gb one. On 12gb vram i get around 50tps generation and 500tps prompt processing and on 24gb one i get 180tps generation and 3500tps prompt processing. I have different configs for different scenarios and I also use llama cpp manager manage all my configs (https://github.com/anubhavgupta/llama-cpp-manager)

aliljet 34 minutes ago

The problem here is always the cost-benefit. For $200/mo, you're receiving subsidized best of breed access. There's no model competing for that price anywhere. If a 27B param model is what you choose, show me your hardware! I would love to be wrong...

wxw 51 minutes ago

> “if we are constrained by performance and price, what architectural tradeoffs do we need to make?” a question that so far has not really been asked in the mad token gold rush.

To be fair, I think the labs are also interested in this (e.g OpenAI parameter golf). But the incentives are tricky. When the subsidies and tokenmaxxing era ends, local models will be essential.

richbradshaw 57 minutes ago

I’m keen to understand speed here etc etc. if I bought a Mac studio with 96GB - what can I realistically run, how’s it compare to fable/opus etc and how fast is it?

Currently maxing out two Claude code accounts every x hours when working on large code migrations or setting up new iOS apps etc - most of time it’s fine but occasionally it’s mega frustrating!

simonw 38 minutes ago

I strongly recommend trying LM Studio - it's the lowest friction way to try out models, you can browse https://lmstudio.ai/models and click "Get" and then "Run in LM Studio" to download and run a model.

With 96GB I'd start with the Gemma 4 and Qwen 3.6 models. Any of those should work fine.

AbsurdCensor 30 minutes ago

I think currently you can only get the M3 Ultra Studio with 96gb, and for coding tasks, say you rub Qwen Coder on it (which doesn't need that much ram), it's not the fastest, something like 30-40 tok/sec. Probably better with a MacBook Pro with the M5 chip. There is a website for comparing different configurations and models: https://llmcheck.net/benchmarks

cautiouscat 46 minutes ago

> I have no concrete scientific evidence of this - my own personal vibe metric of “is a model good enough” is, “do I have to double-check it against an API model”, and GPT-OSS was the first one where I started doing that a lot less often.

The good old butt dyno!

I’ve been eyeing local models more and more with Anthropic squeezing more and more on the subscriptions. A few comments on HN had me waiting until they improved more but this article makes me wonder if I should reconsider that.

I’ve been doing some pretty niche development using a game and a script extender for said game. If these models can handle that, I’d feel good about switching.

simonw 40 minutes ago

I think gemma-4-26b-a4b and Qwen3.6-35B-A3B show that there's something very interesting about a local model that does mixture-of-experts (which helps a lot with performance) and has in the order of 30 billion parameters.

These models are very capable, and use around 20-30GB of RAM while they are running.

Provided you have 64GB of RAM that leaves space for running other applications at the same time.

chrisweekly 27 minutes ago

Obtaining that 64GB RAM is a meaningful obstacle for many.

simonw 15 minutes ago

I'm still amazed that you can run LLMs of this quality on a machine that costs less than $3,000.

I used to assume that anything GPT-4 equivalent or higher would need $30,000+ of server-class hardware.

fg137 15 minutes ago

> I have a 2022 M2 Mac with 64 GB RAM

I closed the article after that.

The author has no idea what a privilege it is to have a machine like that for personal use, and how 99% of the population are not going to afford a setup like that.

Just some back-of-the-envelope maths will tell you that a $20/month Claude subscription makes much more sense financially.

orf 11 minutes ago

99% of the population don’t code using models, local or remote. So that’s a useless metric.

What % of developers could afford an older MacBook model, second hand? Far, far more than 1%.

cube00 54 minutes ago

The challenge I have is getting a large enough context window so tool calls work reliably, the local models easily slip into hallucinated JSON tool responses and won't trigger the tools as a result.

glaslong 13 minutes ago

Same here. I'm curious what others loving Qwen are doing differently, because it constantly hits this issue for me. It's been great for autofilling blocks, but difficult for me to use agentically.

anax32 58 minutes ago

I've just made a milestone on my project, moving away from AWS (budget) to self-hosted and the local models are so much faster than in the past. Beyond LLMs, having embeddings, image, video, audio gen available is crazy.

Running locally is the bar; it's hard to make these things a service which scales.

ibizaman 38 minutes ago

Tangential but reading on mobile, the font size in the code snippets are all over the place. I actually have the same issue on my blog. Anyone knows why?

stared 39 minutes ago

I really recommend Qwen3.6 27B.

Make some tests, and its 8 bit version runs at 30tok/s when using llama.cpp with MTP and run on Macbook Max M5. I have 128 GB, but but 64 GB is well enough. https://github.com/stared/benching-local-llms-on-apple-silic...

When using benchmarks, it gives more-or-less the level of SotA mid-late 2025.

iagooar 12 minutes ago

I run the exact same model, on the exact same hardware - amazing results. Pair it with good search skills (Tavily, Brave, Exa) and you have a near-SOTA model on your desk.

wizzledonker 30 minutes ago

Did you mean 2025?

stared 25 minutes ago

Yes, fixed

wasimxyz 31 minutes ago

https://canirun.ai

xienze 43 minutes ago

The big caveat here is that these local models require you to invest some time tweaking your harness, AGENTS.md, and skills in order to get things roughly to the level you'd expect. But something like Qwen3.6-27B with web search capabilities and a good set of skills really is impressive! Especially considering that you can go wild and not worry about token costs.

The other thing that people tend to gloss over is that you really do need to spend some $$$ on decent hardware. Yeah, you CAN run some 4-bit quant with heavily quantized cache on your 16GB card, but it's not going to be a great experience (I think this is where a lot of the "if you think it's gonna be any good, you're going to be disappointed" stuff comes from). Yes it's a lot of $$$ upfront but it's very much unknown when hardware prices are going to come back to reality. There's a lot of hopes and dreams that any minute now an H100 will be worth pennies because "that's how it's always been" w.r.t. computer hardware, but we are living in interesting times. So you can't just make the tired old assumptions that a Claude subscription over three years time will work out to be dramatically less than the value of some card three years from now. We STILL have basically anything with >=24GB VRAM appreciating in value, which is absolutely wild. What I'm saying is, the depreciation curve may very well be a lot less dramatic and fast than it used to be, going forward.

monegator 9 minutes ago

I've been trying local models for the boring stuff you might be thinking about: writing small docs.

So i've tested a couple, and the speed is finally impressive. My colleague uses paid tiers of claude and GPT, and the speed is comparable. Maybe even slightly faster on my end.

The problem is: i'm running the model on my work laptop, a 12th gen i5 with 16GB of RAM (which, you know, i asked to upgrade to 64, but that was right at the time of the great RAM shortage of the '20s) so i'm pretty limited in what i can use. And this is running alongside the usual suspects: Web browser hugging 1.5GB, MPLABX hugging 3, windows taking at least 5 just to sit idle, thermal throttled to 1GHz ... And yet its speed is comparable to a paid service. A lunch's worth of tokens vs a few cents of power.

So, what i found, what i fount... What i found is that i need AT LEAST 16k of context window, otherwise they will halt when i pass a small C file for analysis. And coding models will shit the bed with 4k. But we all know that, context size is King.

I found out that Qwen will keep looping while thinking, but that's not a surprise to you, either. But give it enough time and you will get an useful answer. I was hoping to using it as a better warning system for some languages, but i fear i need muuuch more context size, because i tried to feed a file that had a function with an endless loop:

At 4k context it almost shit the bed if i gave it just the offending function, then told it where to look at. At 16k context, with the whole file, it needed some guidance to what the problem was, and after 10-15 minutes of thinking it found the issue. Problem is, it kept second guessing itself for another 20 minutes on the same unrelated thing before giving the output. For which the fix was wrong, but the semanthic was correct. Good enough. Maybe it will be faster if i don't ask for a fix (which i didn't i just asked to look for a specific issue)

Wish i had 3 times the RAM so i can see what happens with more context.

Then i gave it the task to analyze a C file to make an API document. It took half an hour, but then i had a good starting point, which i had to keep changing because it would confuse commands with IDs and things like that.

This was the Qwen 3.5 9B model.

I then tested Gemma 4, being impressed at the tokens per second it gives on my Pixel 8A. Same tasks: same issues with short context, with long context it gave absolutely useless answers when looking at code, but it took 1/3 the time of qwen.

In producing documentation, instead, it was much faster, and it never hallucinated data. Good. in 15 minutes i had everything done.

Not bad for stuff running on a business laptop, while doing actual work.

Tomorrow i will try Qwen 3.6, let's see how it goes..