Hacker News new | ask | show | jobs
by root_axis 46 days ago
You are greatly underestimating the hardware requirements for productive local LLMs. Research consistently shows that parameter count sets the practical ceiling for a model's reliability. Quantized models with double digit param counts will never be reliable enough to achieve results in the realm of something like Opus 4.6.
8 comments

Flat wrong. Q6 Gemma 31b feels a lot like opus 4.5 to me when run in a harness so it can retrieve information and ground itself. The gap is not that big for a lot of usecases. Qwen MoE is fast as fuck locally for things that are oneshottable. I have subscriptions to all the major providers right now and since Gemma 4 and Qwen 3.6 came out I haven't hit limits a single time. I'm actually super surprised by the number of things I try with Gemma 4 with the intent of seeing how it fails and then having Claude do it only to come away with something perfectly usable from the local model.
Your n=1 might not be very relevant outside your personal use. In less contaminated benchmarks Gemma 4 is way below Sonnet 4.5, let alone Opus models: https://swe-rebench.com/
Benchmarks only give you the roughest idea of how models compare in real world use. They're essentially useless beyond maybe classifying models into a few buckets. The only way you gain an understanding of something as complex as how an LLM integrates with your workflow is by doing it and measuring across many trials. I've been running Opus 4.7 in Claude Code and Gemma 4 31b in parallel on projects for hours a day this past week, Opus 4.7 is definitely better, but for many things they are roughly equivalent, there are some things on the edge that are just up to chance, and either model may stumble across the solution, and there are some areas of my work that reliably trip up both models and I get better mileage out of writing code the old fashioned way. I understand that I'm just one data point, but I'm not writing CRUD apps here, I'm doing DSPs and weird color math in shaders, I don't think any of it is hard, and the stuff that I think is hard none of the models are good at yet, but idk, they just don't seem that extremely disparate from one another.

FWIW I think Gemma 4 31b is more likely to be of use to me than Sonnet, idfk, maybe it's a skill issue but I love Opus 4.7, undisputed king, but Sonnet seems borderline useless and I basically think of it as on the same level as Qwen 35b MoE.

"essentially useless" is a gross overstatement. Your personal benchmarks will always provide you with the most value, but disregarding standardized benchmarks because you care more about vibes is not exactly scientific.
Sorry, "essentially useless in the context of local model availability". It's a fine model but it's tier of inference is fully fungible.
I’m building a pipeline and testing against gemma4 and Gemini’s 3-1 flash. Both are very good on certain tasks and even n-way clustering works almost perfect almost always.

But they diverge greatly on other particular ones whenever the ViT tower and the apriori knowledge of the world is crucial. I wish Gemma was on par but both me and Google know they not.

You do need to ask whether or not Sonnet or Opus are overkill for a lot of work though. If Gemma4 with some human effort can achieve the same result as Sonnet then it's arguably a lot more cost effective as you're paying for the person to operate each one regardless.
I 100% agree with your philosophy but I wanna note that I genuinely find Gemma 4 31b to be better than Sonnet. To be clear, this makes NO sense to me, so I'm probably just high and making stuff up or just biased by a small sample size since I don't use Sonnet that often. I find that Gemma 4 makes the sort of "dumb AI" mistakes Sonnet makes less often, especially in agentic mode. I genuinely don't know how that can be true but Sonnet feels much more like "autocomplete" and Gemma 4 feels like "some facsimile of thought".
I’m guessing Qwen3.6 for agentic coding and Gemma4 for non-coding stuff?
No, exactly the opposite actually. Qwen3.6 is too imprecise for long running agentic tasks. It doesn't have the same ability to check itself as Gemma does in my testing. I keep Qwen MoE in vram by default because there are tons of tasks i trust it to oneshot and it's 90tok/sec is unparalleled, anything where I don't want to have to intervene too much it can't be trusted.
Oh interesting. I've read that Gemma 4 is really good for creative stuff, but I'm mostly interested in agentic coding. Unfortunately, each time I use Gemma 4, I just get it stuck in loops.
This is probably a precision thing, I think there's a really big difference in long running tasks between q4 and q6.
Ok, you’ve given me the umph to try again. Thanks!
What harness are you using ?

I'm going to switch to local LLMs for most stuff soon.

Overall using screentime as the metric, derived from some imperfect logging and vibes it's about 50% OpenCode 15% Continue 15% my homebrew bullshit 13% Claude Code and 7% Cline. I've been deep on agentic stuff lately (1.3wks aka 3 months of AI time), there are only so many hours in the day to duplicate work and AB test, but in the past I've sworn by Qwen Coder + llama.vim and I still enjoy that workflow for deep work far more than I like prompting agents, but there's a lot of dross I'm learning to delegate.
Interesting.

I stopped doing local stuff for a bit when I realised I didn't know how well it is supposed to work so have been on Claude for a few months now.

I think I'll try OpenCode this time.

Usually I do stuff in devcontainers, qwen code (non local) was the only time I managed to lose some work as it got confused when I ran out of tokens.

There's still quite a way to go - it does seem like Claude code itself is pretty badly coded, so I think there is a space for open source to come in with a high quality harness at some point.

Sorry but you're just seeing what you want to see. The idea that a 31b model is anywhere even in the ballpark of something like Opus 4.5 is just absurd on its face.
False. The absolute capability is irrelevant, with the proper harness 31b is more than adequate for a very large portion of the tasks I ask AI to do. The metric isn't how good the model is at Erdos Problems, it's how reliably it can remove drudgery in my life. It just autonomously reverse engineered a bluetooth protocol with minimal intervention, it's ability to react to data and ground itself is constantly impressive to me. I do a ton of testing with these models, today I had Gemma answer a physics problem that Opus 4.7 gave up on. With a decent harness and context the set of tasks where their capabilities are both good enough is very surprising. The tasks I have that stump Gemma often also stump Opus 4.7.
Maybe reaching for an analogy would be helpful here.

Thot_experiment is saying that his 2016 Toyota Prius is a great and reliable car for his daily commute and running errands.

Whereas everyone is screeching about its capability gap with a Lockheed Martin F35 lightning.

Yeah, thanks, though I think local models are at least a Cessna, which while being nothing like an F-35 can fly.
Flying is fun. But shooting Cessnas out of the sky is more fun!

I'm kidding around. I run 31b models myself too and am perfectly happy with them.

This is like saying that 640kB is enough for anybody.
No, it isn't. I am saying that the set of tasks that can be completed by Opus 4.7 has a surprisingly large overlap with the set of tasks that can be completed by Gemma 31B. It is meaningfully equivalent in many cases.

(of course if i'm being honest 640kB is fine, i'm sure tons of the world's commerce is handled by less for example, the delta between a system with 640kb of ram and a modern one is near nil for many people, the UX on a PoS terminal does not require more than that for example, the hacker news UX could also be roughly the same)

> 640kB is fine

How refreshing to hear this kind of old-school hacker thinking, in a thread where most people have given up on local computing in exchange for convenience and permanent third-party dependency.

With embedded systems affordable and ubiquitous, hopefully a growing segment of the new generation will also learn to push the limit of available hardware and see how far we can take it. As an engineer there's a satisfaction in solving things with what you got.

There's a new technique, 1-bit family of language models that can achieve up to 9x memory efficiency compared to existing models. Still multiple gigabytes for practical use I imagine, but it's great progress toward local AI, which I believe will be common in the near future. https://prismml.com/news/ternary-bonsai

It's more like saying "HIMEM.SYS is not much better than 640kB".
It would be true, if model providers did not throttle their models. I do not have definitive proof they do but the rumors are abundant.
I think you are missing the point here. what matters is for that user the local models are good enough for their use case.
Jokes on you. We are already running Deepseekv4Flash, Mimo2.5, MiniMax2.7, Qwen3-397B locally in very affordable hardware. These models are in the real of Opus4.6. For those of us a bit crazy, we are running KimiK2.6, GLM5.1 and more ...
I have two A100s and have been playing with local models for years. There's definitely moments where they are quite impressive, but small context sizes and unreliability become immediately obvious.

> For those of us a bit crazy, we are running KimiK2.6, GLM5.1

Yes, those can compare to Opus, but you can't run those unquantized for less than $400k in hardware.

Two Mac Studio M3 Ultra 512GB and 1 USB cable can run all those models - maybe about $30,000 in hardware - and based on my benchmarks, those Mac Studios were twice as fast as the A100s on Deepseek v4 Flash, which has a quantization but not really a lossy one.
That cannot run KimiK2.6 or GLM5.1 i.e models within the ballpark of anything offered by frontier companies.
I run kimik2.6 and GLM5.1 on less than $10,000 system. Granted I started putting my system together 2 years ago when things were much cheaper. I run DeepseekV4Flash with 1 million context locally.
Yes it can, but the experience is not great.

A single M3 maxed can run a Q2 Kimi 2.6, though thats with a hardly degraded perplexity.

2x M3s with RDMA can run a lossless Kimi2.6 at Q4, but with CPU only you would get okayish decode but horrible (+1m) TTFT, that wouldnt be a great _interactive_ experience.

They all still fall short of Opus 4.6, definitely though. They are good but fail on extremely complex tasks, in contrast with a frontier model that will keep on trying until it succeeds or exhausts the solutions space.
Not by much, and moving goalposts makes for a bad comparison. Local open weight models are already more powerful than frontier models from only a year back.

If you believe what you read here, the gap is closing fast.

frontier models don't keep trying until they succeed. that's a harness problem and best believe it, the best harness are private and not public.
It is much more of a context window size and model capabilities problem. Local models are not even remotely close in solving complex problems, even when used with the same harness.
Won’t these H100s drop in price in a few years? With the data center build out surely these will become 1/10th the price and you’ll be able to set up a local LLM as good as opus 4.7. Even if the frontier model become more advanced, and memory hungry, you could use the same power usage as your oven to run a current day frontier model as needed? If I could drop $10,000 to have an effectively permanent opus 4.7 subscription today, I would.
> Won’t these H100s drop in price in a few years

Doubtful. The increase in demand is greatly outpacing supply, and all signs point to a continued acceleration in demand

> If I could drop $10,000 to have an effectively permanent opus 4.7 subscription today, I would.

lol well obviously, but realistically that price point is going to be closer to $100k, with a perpetual $1k a month in power costs.

We tend to overestimate the short-term change, while underestimating the long term impact. A lot of hot air will likely vent when businesses realize LLMs didn't magically replace their workforce. Also, prices will go through the roof when energy production inevitably fails to keep up with demand for compute. Also, Moore's law more or less predicts we'll have today's technology in our phones in less than a decade.

I predict the B200 data centers we're build today will be obsolete in 3 years and we'll be using whatever models and hardware that isn't even on a road map today. Likely not NVIDIA, likely not OpenAI or Anthropic. Maybe Chinese?

In the mean time, we must continue building software with the clumsy coding agents tied to cloud services as this (for now) seems to be about the only area where AI economically makes sense.

Why? These models are going to keep drastically improving and given all the new data centers token prices will probably drop a lot in the future. Seems shortsighted given the absurd timelines these things have been improving on.
Cool, thanks for the information. I guess they drive prices down by massively parallelizing requests on say an H100 X8 array? So this is spread across. So if I say, wanted to use it for 8 hours a day in my theoretical world it’d be too expensive. My work definitely wouldn’t pay $100,000 for a server farm even if it’d give an AI to all our employees, you’d have to have engineers, a colocation space, basically all the problems that companies didn’t like and went to AWS for.
Well $100k was a generous guesstimate for some time in the future where something like an Opus 4.7 is old news.

If we think about the near future, something like Kimi2.6 is within the realm of Opus 4.6 today, but requires closer to $700k in hardware to run.

Kimi 2.6 is very close to the Opus family from my experience. Also it does absolutely not require $700k to be able to run locally in an interactive fashion. We are talking more in the range of $10k for a slow Q2 with degraded perplexity, to ~$35k for an acceptably fast 200k context Q4 (quasi lossless perplexity).
taalas!!!
opus 4.7 caliber models are trillions of params, and a single instance would likely run on multiple h200s. $100k of hardware. not coming to your laptop anytime soon.
Parameter size gets you world knowledge and better persistence of behavior as context grows. Both of those things can be engineered around to a large degree, and the latest Qwen models show that small models can be quite smart in narrow domains and short time windows.
… maybe we should just teach models how to get their world knowledge from a local Postgres connection! Then the model can be tiny, and it can query to its little heart desires AND run on commodity hardware TODAY!
Yes and no.

The best analogy is the difference between having N senior level engineers working for you, versus having N entry level engineers.

With frontier cloud models, you can give a single invocation one task, and it can figure everything out.

With local models, you have to manage the inputs and outputs quite a bit more, but you can achieve similar results for tasks you set up harnesses for. They are not as a good at finding the right answer internally from their own weights, but they are very capable of ingesting context and reformatting text - for example, for debugging, local models can debug issues quite well if you give them the error and documentation for a particular feature you are trying to implement.

It depends on what you mean for 'productive'. Article mainly seems to be about targeting consumer level hardware, such as the Neural Processing Unit you need for a 'Copilot PC'. Windows Recall is (was?) one such local AI application. If Microsoft get their way and my next PC has one, I look forward to using it for 'productive' purposes such as playing games, handling natural language stuff and leaving my GPU free for GPUing.
> You are greatly underestimating the current hardware requirements for productive local LLMs.

Fixed that for you. Right now most models produced are based on floating point maths and probabilities, which is "expensive" to do math on.

Microsoft has researched 1-bit LLMs which can run much more efficiently, and on much cheaper hardware[1].

If this research is reproducable and reusable outside their research models, this means the cost of running self-hosted LLMs will be reduced by an order of magnitude once this hits mainstream.

[1] https://github.com/microsoft/BitNet

i would argue we don't need anything near Opus to be productive. Sonnet is plenty productive enough
I use Opus 4.6 as an example because it's the LLM that has been widely recognized by the public as being reliably capable of doing real work across many domains. However, the same logic applies to Opus 4.5 and even previous generations. These models have huge parameter counts and large context sizes, there's no training technique that can compensate for those qualities in small and quantized models.
> we don't need anything near Opus to be productive. Sonnet is plenty productive enough

For niche applications, sure. For general use, I think the tendency towards the best model being used for everything will–to the model publishers' delight–continue. It's just much easier to get a feel for Opus and then do everything with it, versus switch back and forth and keep track of how Haiku came up with novel ways to dumbfuck this Sunday evening.