Hacker News new | ask | show | jobs
by MisterKent 6 hours ago
Apple is actually interesting. They are one of the few companies with a chip / PC play with real power AND basically no play I'm the hyperscalar market.

That means they're actually incentivized at least short term, to benefit PCs becoming strong enough to do local LLMs. Which makes this play make even more sense. Though, I've been saying for a while that the local AI inflectiom point is the death knell for these frontier labs.

7 comments

> Though, I've been saying for a while that the local AI inflectiom point is the death knell for these frontier labs.

"Death knell" is a touch hyperbolic. Hardware that can only run quantized models that take up GBs in VRAM falls short of even an A100 (by almost an order of magnitude[0]), which in turn falls short of what an 8xH100 cluster can do (also by another order of magnitude[0]).

I'm an avid believer in local LLMs, but I cannot deceive myself - data center accelerators will win on power dissipation numbers alone[1], even when giving generous allowances for higher efficiency on Apple chips - and assuming the Apple-efficiency advantage persists on the same TSMC process node.

0. Based on my unscientific fine-tuning training experiments across local and rented GPUs. YMMV for inference.

1. Unless Apple surprises everyone and brings back the XServe with M7, if not, then laptop and desktop for factors simply can't dump heat fast enough to compete head-to-head, and will be designed for lower input wattage.

Doesn’t need to be a winner head to head. If it can do 90% of the tasks the big boys do, at 50% speed, for virtually no extra overhead cost save for the power consumed by a prompt - that’s gonna work for a lot of people. And that’s also basically where we’re at today. Qwen3.6 35b running quantized on 10 year old hardware solves basically all of my uses cases for agents except for coding.

The frontier models are faster, and better at coding, but not so much that i’ll pay $200/month for them.

Consider this. One of the smallest Qwen models (4B parameters) powers my home automation voice assistant, and runs on CPU alone at >20 tok/s. It is enough for that use case, and could be made even better/faster with a modest GPU. It isn't as smart as some cloud-connected thingamajig, but I would never allow a literal Google or Amazon bug in my home. Huge SOTA models aren't relevant everywhere. Most people use LLMs for rather trivial tasks such as finding typos or drafting text.
But with Apple's AFM 3 architecture, we might end up with huge SOTA adjacent on devices with limited RAM.

They use a technique where you only load between 1B and 4B of a 20B dense model for an entire prompt run, not token by token like a MoE, and use mostly the low power ANE instead of GPU cores.

Now, imagine if/when they scale up to 100B or more? On a chip using 2W?

I think we're also ignoring a potential innovative move in how models work.

If someone could splinter or fragment the models into more specific tasks i.e "spellchecker AI" and get these working as well as Sonnet 4.6-4.8 on those tasks on a personal laptop. You then question the $100 a month fee.

Bear in mind these laptops are likely to be $5000 or so because of the memory, HDD and M7 chip they likely need.

It feels to me like the beginning of the inflection point but software updates not hardware updates will be the accelerant.

Curious, what exactly does it do for you? I has bad luck with these small models to do anything useful tbh.
> If it can do 90% of the tasks the big boys do, at 50% speed

I want to live in this world too, but these numbers, as of today, are very aspirational and far removed from reality.

I'm no tokenmaxxer; I find my modest local setup useful, I also know the limitations, it's slow and it sucks (relatively) at high-level and/or long-context planning, compared to frontier models. Only a minority of my prompts are max-effort - its not all I do, but, it also means frontier labs aren't dying any time soon

Consider also that right now LLMs run slowly enough you can watch them think. I've seen a demo of an LLM running at an absurdly high speed and it reminds me of when I moved from a 2400 baud modem to a 14.4 - BBS screens that I could watch draw were all of a sudden nigh-interactive. Faster-than-realtime video generation is also coming, and will also continue to require huge hardware for a long while yet.

I love local models - I have a machine at home that runs a few for me and it's a lot of fun - but for the time being they are not super trustworthy on tool calls and staying on script. Another year or so might change all that!

If anyone wishes to see the future. A fast LLM is quite eye-opening. I think chatjimmy uses Talaas' chips where models are hardcoded into the silicon.

https://chatjimmy.ai/

Thanks, I didn't know that one! Very impressive speed although quality seems very bad
What does your local setup look like?
I’m sure you’re right, for the things you are asking of an llm, just as I am right about the things I am asking of an llm.

The real question is, what are 90% of people going to ask llms to do. I’d argue mostly it’s going to be stuff that works-now or almost-works on local models, but that’s just an opinion. It also depends on the frontier models hitting a wall of steeply diminishing returns, since they set the expectations for all of this stuff - my gut says that’s happened already they just won’t admit it for a while - but we’ll see.

This is what makes sense for me as well. All I need a local model is for playing with simple graphics: no gradients, at most ten colours which I can push through VTracer to get an SVG. Draw Things does the job, usually in 120 seconds or less.

Sometimes, I need a quick throwaway bit of python. That can take 30 minutes of my time.

The established AI players have no financial interest to make LLM available locally. They aren't hardware companies and if running LLM requires paying them to host the models as well then they can naturally capture more of the value chain = more revenue.

Apple is the only player here where it would play into their natural hardware incentive to get you to pay more for better hardware. It would make sense for them to find a way to run LLM locally (eg, newer architectures that others here have pointed out).

Interesting times.

Is it hyperbolic though? One of the best things about the compute and memory shortage is that people are going to insane lengths to optimize things to run on lower memory / lower compute devices. If we keep this up for a while and then ramp up memory and local compute production, that AI inflection point may actually come.

Of course, these are a lot of ifs.

If we advance just 2x in hardware plus 2x in software, all coding can be done on local hardware imho.
That’s about 4 years in hardware cadence alone. There is a lot of room to improve memory bandwidth, and performance is a given with every process node. IBM has shown yesterday they can do limited runs on 0.7nm (density equivalent).
The thing is, with the level of hard investment AI vendors have, even a small reduction of their addressable market is significant. They aren’t profitable, and inference is getting commoditized fast, so even if they eventually become profitable (not via financial engineering) they won’t be able to have good margin. The pressure of both open models AND local models is pretty bad imho
The big question for local LLMs is whether there is a 100 tok/s model which requires less than 16 GB of memory and is competitive on most tasks with the cloud models.

There is some signal that this is possible through both hardware innovation and training/data improvements.

Cloud models have their own constraints - I can’t have opus4.8 spend 4 hours on a deep research question I had in the shower without spending money. I can’t do real time video game upscaling and graphics work in the cloud period.

A laptop is about an order of magnitude cheaper than a cloud server thanks to economies of scale, uptime requirements, and other factors.

if you do the electricity math you'll see that you pay more on local models while getting less (local is more heavily quantized) compared with OpenRouter.

I'm not talking local Gemma/Qwen vs cloud Opus, but against OpenRouter same Gemma/Qwen

there are reasons to run local - privacy, availability, but cost is not one of them

That's assuming consumption pricing remains as-is.

There has been a lot of market-subsidy in AI which is starting to fade away: e.g. the copilot quotas/pricing. When VC switches from investing to wanting a return, the price equation is likely to change.

There is no subsidy on most OpenRouter providers, they are profitable today.

You buy a big GPU, you serve LLMs, you print money.

> The big question for local LLMs is whether there is a 100 tok/s model which requires less than 16 GB of memory and is competitive on most tasks with the cloud models.

Benchmarks maybe? Real world, no.

You just need the context otherwise. There's no way around it.

I'm not paying for a super computer to do my taxes if a cheap pc can do it for free.

So yeah, commercially it might be a death knell. Yes there's still a market for super computers, but would your rather own Apple or Cray?

> would your rather own Apple or Cray?

I would consider an HPE tower server with a processor on the same league as an M6 or M7 under the Cray brand.

We'll likely see a transformation in how frontier models are trained as a result of a push towards local inference. While it seems unlikely now, given current pricing for RAM, in 10-15 years it's not unthinkable to assume we could see individual machines with 10-12TB (and well beyond that) of RAM which are accessible to the GPU. Min/max system RAM increased a LOT from 2010-2025 and largely because it was cheap. Once the hyperscalers aren't generating revenue for the RAM manufacturers, I wouldn't be surprised to see a massive push towards consumers in order to maintain gross profit. Not to mention new players who enter the market because the margins are measurably absurd right now.

At some point there will be diminishing returns towards the "just throw more RAM at it" approach the current frontier models are taking. Commoditization is just as inevitable as it ever was... and in doing so will enable actual leaps of what AI/ML is capable of. That's not to say there won't be a place for 99.999999% accurate vs 99.99999% but those cases will be limited and likely prime to disruption based on real innovation vs access to capital.

The 1080ti is out there for almost 10 years now. It has 11GB of VRAM. A 5090 has 32GB.

SOCs with unified memory have shifted this a bit forward, but they're also expensive as shit.

10TB ram in a consumer device is simply not happening in the next 10 years.

Half a year ago you could get a AI max 395+ with 128GB ram in mobile form factor for ~$2200. The same thing costs $3700. Same SoC, same memory.
10TB is about 80 times that, 200K in today’s money. A lot of capacity is coming online in the next 5 years and it’s reasonable to think we can get there with better process and stacking (the latter does little for pricing, but enables shorter latencies).
I agree with the general direction but I'm a little skeptical of the "just add a few more TB of RAM and the frontier moves local" version of it
I think this is right but it also depends on what "compete" means
Indeed. Local models becoming available and halfway decent don't obviate the laws of scale. And because there's no ceiling to what scaling more will buy you in terms of capability, there's no reason not to scale more, there's no incentive for billionaires not to grab all the fab capacity they can.

Enjoy paying $1000 or more for a little 4 GiB cloud terminal that connects you to all your online accounts where all your actual work gets done. This is the future.

>there's no ceiling to what scaling more will buy you in terms of capability

This is highly doubtful.

Rule of thumb: everything people think is exponential is actually an S curve.

better rule: exponentials are overlapping S-curves
There's a limit that won't be breached without a fundamental breakthrough in physics of computation, but we're not there yet by a long shot. You can train bigger models, faster, and infer with them faster and more precisely, by throwing more compute at the problem for the foreseeable.
At some point, and I can already see it, they’ll be better than us at writing code. We are still in the loop to coerce them into architecting well, but that’s nothing magical.

What’s frontier now is prosumer in a couple years and commonplace in a couple more.

Indeed. If Apple makes it feasible to run models like GLM 5.2 at home, I will become their customer.
It's plausible but is the Apple Tax for a 1TB memory machine on top of current memory prices really worth it? I paid around $4000 for 4090m laptop with 16GB VRAM back in 2023, it's great but DoA for even quantized LLMs. I can run SLMs and fine tune it but that's it.

We need one of those specialized inference chip startups to succeed and a PC manufacturer willing to bet on them against Nvidia for the local AI to find mass market appeal.

I recently bought a Mac mini M4 16 GB - mostly to run Immich. I assumed I needed a Linux box. After a lot of researched I was quite surprised that the mac was the cheapest option. So not always an Apple tax.
> After a lot of researched I was quite surprised that the mac was the cheapest option

For Immich, the cheapest option will either be a NAS or a used laptop depending on the amount of data you need, I wouldn't buy a mac for that.

Maybe he wants really fast or large AI models inside immich?

(I just run the defaults on my CPU, works for me)

I'm not sure why you would need large AI models for Immich, the face detection is pretty cheap and will run on 10 year old hardware without a blip.

I think the decision comes primarily on how much data you would like to store for Immich, if you want to go cheaply, a 100 bucks used laptop will do the job, if you have too much data, a NAS will be more suitable (and you are certainly not going to get a mac where you can plug multiple internal hard drives for the price of a NAS)

>" After a lot of researched I was quite surprised that the mac was the cheapest option. So not always an Apple tax."

Apple has always been the most cost effective choice for the value you get going all the way back to the Apple II, it's just that the floor of that cost has always been high. Anyone who thinks otherwise is a just a fanboy one way or the other.

That's true only for the entry level macs. My M4 Mac Mini has the best Performance/value. But my workstation laptop with 32 cores, 96GB DDR5, Nvidia GPU costs lesser than Macs with lesser performance; not to mention I upgraded the RAM post purchase.
RAM upgrade possibility has a downside though - very low RAM bandwidth, which is highly relevant today if you want to run local LLMs
Hmmm, not always. Between at least 1998 and 2005, PCs were just better. Better CPUs.
If you think the apple tax is high you should see the nvidia tax.
> I paid around $4000 for 4090m laptop

That's how much many developers currently spend on tokens - every day. Whatever "Apple Tax" applies to a device that can run a capable model offline will amortise itself in a blink.

>Whatever "Apple Tax" applies to a device that can run a capable model offline will amortise itself in a blink.

Current high-end Mac Studio with 32-core M3 Ultra chip and 96 GB of memory is $6800, 96GB is not enough to run GLM 5.2 without extreme quantization or stacking HW; but for the sake of discussion let's run quantized version on a single high end Mac Studio.

GLM 5.2 Max plan costs $ 112/m, so it would take ~60 months to recover the costs assuming the machine was bought just for AI. By then the current AI landscape would have changed drastically.

I use local AI on both Linux and Mac every single day, there's freedom, privacy and peace of mind in running the model locally. But I feel cost/value of local AI is overblown.

In what sustainable world outside of Bay Area jobs do devs spend 120k on tokens monthly?
Nobody said anything about sustainable, or outside of the Bay Area really
you're not a customer of any of their products at all already? not a single apple device in your household?
I didn't have a single Apple device in my house until a month ago when I bought a Neo. The last Apple devices I had before that were an iPod Nano and a PowerMac G5 many many years ago.

Apple has pretty good competition in every segment with the exception of maybe the iPad, but I'm not a tablet user.

None. And I have a PC, a personal laptop, a work laptop, my current and my previous Android phone.
Some folks like to have a computing environment free of proprietary influences and extremely strong vendor lock-in. I cannot claim to posses any apple devices.
How does the Mac have extremely strong vendor lock-in?

Sure, you can use the App Store and use all the stuff that integrates with iPhone, iCloud, etc

But you can also just treat it as Linux for Laptops (that actually works), and roll with all the standard open source tools.

I don't disagree with you, but technically speaking MacOS is still proprietary and Asahi is not compatible with the latest and greatest Apple devices.

While they don't _prevent_ Asahi from doing what they're doing, they certainly don't go out of their way to make it easy for them.

I wasn't thinking of Asahi. Just pointing out that you can run all the standard unix/open source tools and apps on Mac OS (vi, git, qgis, blender, vsc, python, node, etc). With the advantage of higher quality hardware and generally less fiddling.

But if you don't like it, switch. I don't see vendor lock-in.

Apple is also iOS
Correct. Been using Linux and Android for over ten years. My household had no Apple devices until I got married.
No, I've never owned an Apple device in my life, neither has anyone in my family to my knowledge.
What a bizarre bubble you live in to even be asking this question... I've never owned a single apple product, and never will.

And in the rare occasions in which I have to use someone's MacBook, I'm completely lost - like some elderly person.

there a many people who don't own Apple. Why are you so surprised? I certainly don't and never will. What's it got that I can't get on a standard PC + Linux?
Not. A. Single. One.
They do stand in front of a great opportunity that would also benefit consumers, which seems rare in the llm era.

If people can get opus4.6/gpt5.5-like models locally, labs could raise their prices and sell token speed, better reasoning, mobile-focused improvements, you name it.

Not all consumers are power users and many will be happy to pay for flexibility.

Most people don't actually want to manage models, updates, context limits, quantization, etc. They just want the thing to work everywhere
Once one person figures that out and writes a blog post, everybody else can do it.
Yes, just like 90% of regular users set up NASes instead of just using Dropbox or Google Drive.

https://xkcd.com/2501/

Tangential: About 8 years ago ex-Apple chip engineers left to design server-grade chips, this was Nuvia, and they got sued by Apple to the point that they had to get acquired by Qualcomm.
Then, after getting acquired by Qualcomm they got sued by ARM.

So maybe they were assholes.

I worked at a hyperscaler when the M1 came out. A MacBook Air M1, running a Linux VM was faster and more energy efficient than anything we had in the data center.
I'm not sure it's a death knell for frontier labs so much as a narrowing of what people need them for
When you've raised hundreds of billions in funding, every result except "to the moon" is a death knell.
I really wish people stopped saying things "I've been saying that"

why not just say "I think that"

do you see yourself as some kind of visionary about this particular topic? literally EVERYONE is saying that, it's the most obvious fact about AI