Hacker News new | ask | show | jobs
by crazygringo 31 days ago
> within a few years we will be running local models as good as today’s frontier models with almost no cost burden

Based on what? The RAM requirements alone are extraordinary.

No, running large models on shared, dedicated hosted hardware at full utilization is going to be vastly more cost-efficient for the foreseeable future.

10 comments

> Based on what?

I take it you haven’t actually run any of the current gen local models?

They all fit on fairly accessibility hardware, and their performance is at least on par with what I was paying for last year.

I have one of my agents running entirely from a local model running on a MBP and it has repeatedly shown it’s capable of non-trivial tasks.

Playing around with another, uncensored, local model on my 4090 desktop has me finally thinking about canceling my personal Anthropic subscription. Fully private, uncensored chat is a game changer.

For work it’s still all private models but largely because, at this stage, it’s worth paying a premium just to be sure you’re using the best and it saves the time of managing out own physical servers. But if we got news tomorrow that Anthropic and OpenAI were shutting down, a reasonable setup could be figured out pretty quickly.

What kind of useful context window are you getting on a 4090, out of curiosity?
256k tokens for both the MBP and the 4090
Can you share more about your setup please? Like which models and other specs on the machines?

Edit: ah I see the models mentioned in another comment of yours

Which models are you currently using?
Was using Gemma-4-A3b-26B for a while for chat (using llama.cpp for backend and Open Web UI for client features). I’ve been using Qwen-3.6-A3B for agents and am currently playing with one of HauHuaCS’s uncensored Qwen models for chat and really liking it.

I also have an agent using Kimi 2.6 as a backend (which is open, but not local) and for some coding tasks as well.

Local modals are 6 months to 18 months behind frontier. Even if the performance of a cloud model is faster, it's clear that local is catching up.
> Local modals are 6 months to 18 months behind frontier.

I wish this was true but it is not. And I am working on open source models so if anything, I would have a bias towards agreeing with you.

Frontier closed models (GPT/Claude) are gaining distance to everybody else. Even Google, once the king.

Your claim is a meme coming from benchmark results and sadly a lot of models are benchmaxxed. Llama 4, and most notably the Grok 3 drama with a lot of layoffs. And Chinese big tech... well they have some cultural issues.

"Qwen's base models live in a very exam-heavy basin - distinct from other base models like llama/gemma. Shown below are the embeddings from randomly sampled rollouts from ambiguous initial words like "The" and "A":"

https://xcancel.com/N8Programs/status/2044408755790508113

---

But thank god at least we have DeepSeek. They keep releasing good models in spite of being so seriously resource constrained. Punching well above their weight. But they are not just 6 months behind, either.

I’ve worked, for a long time professionally, in the open model space for 3 years and up to 2 months ago I would have agreed with you. But it’s empirically not the case today. These models (combined with a good harness) have dramatically improved in both power and performance.

Gemma 4 was a major improvement is self-hostable local models and Qwen-3.6-A34B is a beast, and runs great on an MBP (and insanely well on a 4090).

The biggest lift is combining these models with a good agent harness (personally prefer Hermes agent). But I’ve found in practice they’re really not benchmaxxing. I’ve had these agents successfully hand a few non-trivial research projects that I wouldn’t have been able to accomplish as successfully even last year.

When you add in the open-but-not local models, Kimi, GLM, Minimax, you have a lot of very nice options. For personal use anything I don’t use local models for I give to my Kimi 2.6 powered agent.

For specific use cases, absolutely, a harness and other techniques help (this is literally what I'm working on). But GP was talking about general use.

Over-promising is a very stupid thing. Nobody will value the intermediate steps. Nobody will value all the effort because they will always compare us with frontier models made with billions and we will become a running joke. So please stop.

Over-promising is what the frontier companies are doing. I'm not pretending open weight models are gonna do your homework and pay your taxes and remember your wife's bday with a super personalized gift. I'm just saying that they seem pretty good for what they are. There's no promise being made here.
Kimi k2.6 is about on par with GPT 5.2 so I’d say open weight models are about 6 months behind.
The Q4 quantization requires about 600GB of RAM without context, not exactly consumer hardware friendly.
Has Kimi found a way to vastly reduce the amount of VRAM required without running at 3 tokens per second? That’s the real concern.
I said "open weight" rather than "local". I mean, local if you have $240k to drop on GPUs but you can run Kimi k2.6 on a B300 cluster for ~$50/hour too.
The Chinese models should stay close on a lag. They’re doing a ton of distillation that, realistically, I’m not sure the American frontiers can stop.
US labs got tough on "adversarial" distillation [1]. I suspect that's one of several reasons why Chinese big labs are lagging again.

[0] US AI firms team up in bid to counter Chinese 'distillation' (Apr 7) https://finance.yahoo.com/sectors/technology/articles/us-ai-...

Yeah I mean the US has gotten tough on, like, foreign interference in elections and cyber security, but if you have the Chinese state behind you—which they absolutely do and as an observer, obviously, they have to—no company can stop them.

Case in point: North Korea, with far, far fewer resources.

Local models are ~18-24 months behind the frontier on approximate intelligence, and then like 36-48 months behind the frontier on inference speed for nice hardware.
You still need the hardware

I've got a 128GB strix halo staying warm at home, it has nothing on top models with big budget. It's good supplement to low end plans for offloading grunt work / initial triage

Have you looked into DwarfStar 4?
Been away from home for nearly a month, so was mostly going off Qwen 3.5 122b-a10b (Q4?) / Qwen 3.6 35b-a3b (Q8) / Gemma4 31b (Q8)

Thanks for suggestion tho, tool by antirez is always going to pique interest, I'll check it out when I'm finally home again

Tho says Metal / CUDA, so doesn't seem friendly to Linux AMD system

His quant that fits into 128GB looks interesting for Spark DGX as well IMO.
It is not getting easier to obtain hardware that can run models which are sufficiently useful to undercut frontier models, if anything the cost of such hardware has gone up by 25% or more just in the past 6 months.
I think hardware prices will come back down once we start seeing more efficiency improvements in models and hardware, and once more people and companies self-host models (which seems to be happening more and more these days). I think the massive infra/hardware expenditures of OpenAI and the like are going to end up unnecessary, leading to hardware price drops.
If companies decide to self-host, wouldn't that drive the demand and therefore prices up? Most companies currently do not have the needed infrastructure.
I think companies will self host (including on rented hardware) even if it's more expensive, and that, along with efficiency improvements, will drop demand for big AI. I think big AI is overspending on hardware/datacenters at the moment.
How do you know this? I'm not trying to attack your statement, I am genuinely curious how anyone knows anything about model performance outside of benchmarks that are already in the training set.
using them you kind of get a feeling for skill level and can extrapolate that better than juiced benchmarks.
> Local modals are 6 months to 18 months behind frontier.

At what tps? You can run the new gemini flash or 5.3 codex spark at 1000+tps and run circles "open" models. You can't run anything useable locally without at the very least a blackwell 6000 if not two

Sure you can run qwen 3.6 at 20tps on a mac 128gb but let's not pretend this will get you anywhere

if that's true - and in 6 or 12 months i can get what i have today, it might not be worth paying anthropic.
You can now buy 128 GB unified memory computers from AMD as commodity.

They’re still pricey, the world is still scaling up memory production, and a lot of code isn’t yet built for AMD, but we went from the Wright’s brothers first airplane to jet engines in 27 years.

I’m not sure “it’s only a few years away” but we are sure moving there fast.

> first airplane to jet engines in 27 years.

Nitpick: more like 36 years, from Wright Flyer in 1903 to Heinkel 178 in 1939. Still quite impressive.

nittier pick: They said engine, not airplane: the first jet engine ran in 1907 (pulse jet). The first Turbojet engine ran in 1937.
I believe the same thing but keep repeating the question: Then what are all the datacenters for?
Non-cynically: the frontier providers have a projection for demand.

Cynically: it’s become an executive-level gpu measuring contest. If you’re not making huge commitments on data centers, you can’t be a serious player.

Realistically: It’s a mix of the two. The recent Claude caps for agentic usage suggest that demand exceeded their immediate compute supply. That they can alleviate it with additional capacity from the existing and small-ish xAI facility suggests that either demand may not be rising quite as fast as anticipated, that they’re okay in the short term until more capacity comes online, or a mix of both.

Open questions:

1. At what price point does demand fall, and are the frontier providers overall profitable before that price point?

2. At what price/performance point do on-prem local models make more sense than cloud models?

I print documents and photos at home regularly but I still contract out to dedicated print shops.

The print shop can’t replicate the practicality of local printing and I can’t replicate their scale of investment. Both coexist perfectly.

Print-outs are a physical good. Tokens aren't.
They are both fungible. You can replace one with the other.
How does that relate to my comment. I didn't say anything about the fungibility of either. Physical goods have wildly different logistical constraints compared to anything digital. This, and only this, I would argue, makes their production at home attractive to consumers. Tokens just don't have these properties.
Agents
>running large models on shared, dedicated hosted hardware at full utilization is going to be vastly more cost-efficient for the foreseeable future.

That is only true right now because hundreds of billions of dollars are being burned by these AI companies to try to win market share. If you paid what it actually cost, your comment would likely be very different.

No, it's economies of scale and I don't understand where anyone is coming from that thinks they'll be better off buying their own hardware, why would you get a better deal on MATMULs/watt than the cloud providers ?
Within 5-10 years you're going to see a box like one of those AMD Halo nodes running homes.

They'll be controlling lights and temperature, they'll be adding calendar reminders that show up on your phone and your fridge. Your phone and devices might sync pictures and videos there instead of the large cloud providers. They'll also be a media server, able to stream and multiplex whatever content you want through the home. They'll also be a VPN endpoint, likely your home router, maybe also a wifi access point.

I think this makes quite a bit of sense. I don't think they'll be ubiquitous, but they could be.

This distributes the power demand where local solar generation can supplement , gives the home user a lot of control, and claims overship of the user data from big tech.

Maybe I'm imagining things but this is what I think is coming.

It's the lmm/data heart of the home. A useful digital tool.

It's amazing to me. You say this like it isn't an absolute horror. We've really ramped up the malignant bloat of the software industry if it goes this way.

We'll have this massive machine to do "home automation", something that by all rights should be possible with less computing than is deployed in smartwatches today. Yuck...

Moving the LLM from SaaS to the home, reducing the power distribution problem, and giving people control back over their data - getting it away from Big Tech. The home controls should also be more responsive that most current modern home automation that mostly uses wireless and Bluetooth to a cloud service. These are all good things.

That's just one piece of the puzzle. If you're running the LLM there's no reason your family's mobile devices couldn't use said home LLM box to save battery life on their devices while maintaining control of their data, searches, photos, files, etc.

Umm, you can do basically all of this, today, with Home Assistant and a handful of add-on apps.

I use a local LLM with it, but you can use a hosted LLM if you like.

The core home automation stuff can run on a potato. The LLM just writes new automations when I ask it, or acts as a natural language interface.

I use a pretty small 4B parameter local LLM, on a fairly modest mini PC. It doesn't take a frontier model to do that kind of work.

I am 1000% aware of this, but I think we're going to see more packaged solutions in the hardware front.
Another victim of Goldratt's Theory of Constraints. Some things are more important to optimize for than MATMULs per Watt. What that is I leave as an exercise to the student. May you realize what it is before it is too late.
Some individuals will choose some $10,000 hardware so they can keep freedom and privacy and that's well and good, my point is just that freedom and privacy is not what wins marketshare, and hence, IMHO, local LLMs are not going to catch up and surpass frontier models like some in this thread like to claim
> freedom and privacy is not what wins marketshare

Digital sovereignty laws may mandate/remove access to LLMs of other countries on economic and national security grounds.

We don't know the parameters but it probably takes at least a H100 and possibly several to run a SOTA model. Given the pricing (25+k per H100 + hardware to run it) and power (700W per H100 + hardware to run it), I don't see how anyone except for a largish company can afford to run this.
Are you serious? It’s multiple nodes to run a frontier model (a node is 8x GPUs), and they aren’t running on H100s. You are looking at 32+ GPUs.
I was being pretty generous to the comment I was replying to. Needing 32+ H100s just strengthens my argument that people aren't going to run frontier models locally anytime soon.
> shared, dedicated hosted hardware at full utilization

I must say that the largest dedicated hosted hardware providers now, like Amazon or Google, to a large extent do not produce the software they are offering as a hosted solution (like Linux, Postgres, Redis, Python, Node, etc). Similarly I'm not sure if the producers of the frontier models are going to keep their lead as the service providers for the most widely used models. They would need to have quite a bit of an edge above open-weights models.

Also, models are given very sensitive data to process. For large organizations, the shared dedicated hardware may look like a few (dozens of) racks in a datacenter, rented by a particular company and not shared with any other tenants.

> The RAM requirements alone are extraordinary.

At the same time, $100 a month is A LOT of RAM.

I strongly disagree. Humans are so insanely well incentivized here with trillions in market share to make localized AI good enough and that’s the only benchmark they need.
Are they? I don't believe there's that big of a market for local AI. Most people don't care that much, and you'll most likely lose the advertising revenue.
>I don't believe there's that big of a market for local AI. Most people don't care that much,

I agree that the market for local AI is basically limited to nerds at this point, but that's because nobody's really explained why local AI is a good thing and also because the vast majority of people need the $20 paid plan at most. How much time and money would it take to get something half as good as OpenAIs products running locally?

> that's because nobody's really explained why local AI is a good thing

There are a lot of good things that need to be explained to people, but nobody ever managed to. I don't think this will be any different.

> because the vast majority of people need the $20 paid plan at most

Exactly, people are not gonna invest time and money when there's already something else that satisfies their need.

Local AI will need to be both better and more convenient in order to be adoped by the masses.

Free is always better.
Linux is free too, but it hasn't replaced Windows despite lot of people explaining why it should be better.

Moreover local AI is not free. You'll need proper hardware to run it which you have to pay extra for it.

It will take another [human] generation before AI is well integrated into everyone's daily lives where people will expect a local model handling things for them. I don't think the killer app has arrived yet (OC is a hint of what is to come).
An M5 macbook pro running LM studio with Gemam4 or Qwen3.6 is basically the original GPT 3/ gpt3.5 chatgpt experience. So about $3000 usd
I agree that the vast majority of punters don't care about "local AI".

However, if you can deliver 90% of the value of AI for 90% less cost, that is a really big incentive. Companies will spring up to fill that kind of gap.

Nobody can undercut the big AI players right now because they are all over-funded by VC money. Once the frontier companies try to match cost to expense, suddenly they become very, very vulnerable.

Qwen 3.6 is virtually indistinguishable from Claude on my 5090
What kind of codebase do you work on (number of lines?). How many tokens does your local context support?

Maybe your statement is true for smaller codebases and shorter conversations, but I’d be surprised if you actually achieve good results on millions of lines of code with a million token context.

Granted if your setup works well for your workload then that’s all you need.

I run Qwen3.6-35B-A3B on my 8GB VRAM GPU for 3 weeks now and its been blowing my mind how good it is (coded multiple tools that I use daily, setup CI/build scripts for several projects, meaningfully contributed to a large personal project, etc).

No one can deny that right now these new compact models are not as good as frontier models but for the first time we actually have competent local-first models. If I give you a local model that runs on your current hardware and performs at 75% of the ability of a frontier private paid model, would you still pay for frontier? More importantly, would you hand control of your processes and code to them knowing enshitifcation and price-hikes are always lurking nearby?

For businesses, I get it you want to compete. But personally, it's over. Even if I considered for a second paying OpenAI/Claude, not gonna happen now.

Not really, I can run models on my 24GB mac.