Hacker News new | ask | show | jobs
by antirez 50 days ago
A random, funny, interesting and telling data point: my MacBook M3 Max while DS4 is generating tokens at full speed peaks 50W of energy usage...
7 comments

"Data centers for LLMs are technically more energy efficient per-user than self-hosting LLM models due to economies-of-scale" is a data point the internet isn't ready for.
But if you're running it on your own hardware you might only generate tokens when you have something useful to do with them, instead of every time you load a Google search results page because Google decided the future is stuffing Gemini-generated answers down your eyeballs instead of letting you read it yourself from the primary source for 0.1 watts.
Whether I'm using Google or not is completely unrelated to whether I use OpenAI (for example) API or run LLM locally
Don't worry, capitalism takes care of that.
If LLM's were a mature product then this would be true at some point. However, you could argue (and I will) that the popularization of on-device LLM inference will lead to two things:

- Consumers of LLM inference (developers and hobbyists) will be more aware of compute cost, leading them to develop more token-efficient uses of LLM inference and be incentivized to pick the right model for the right job (instead of throwing Sonnet at the wall and follow up with Opus if that doesn't stick)

- A larger market for on-device (and therefore open weight) LLM's will probably result in more research concentrated on those inherently more efficient (because compute/memory-constrained) models.

I think that despite the inefficiencies, shifting the market towards local inference would be a net positive in terms of energy use. Remember that 50W might seem like a lot, but is still much less than what, let's say, a PS5 draws.

Also remember how AWS had the same promise and now we're just deploying stack after stack and need 'FinOps' teams to get us to be more resource-efficient?

There's a bunch of companies doing garage GPU datacenters now. Probably can act as a heat source during winter too if you have a heat pump.
That's an interesting idea [1], the value being that its easier to build servers into a bunch of homes that are being built than building a datacenter. Every now and then something reminds me of "Dad's Nuke", a novel by Marc Laidlaw, about a family that has a nuclear reactor in their basement. A really bizarre, memorable satire [2].

[1] https://finance.yahoo.com/sectors/technology/articles/nvidia...

[2] https://en.wikipedia.org/wiki/Dad%27s_Nuke

Separate to the self-host/datacentre argument, it would be interesting to see a speed/performance/watts-per-token leaderboard between leading models. Which model is the most watt-efficient?
Akbaruddin
?
I thought this is a pretty generally accepted fact?
I've seen plenty of people on HN claim that LLM's running on their phones is the obvious future in terms of not just privacy but also efficiency, i.e. better along every possible metric.

They don't usually go into much detail, but the impression I get is that they think data centers are energy monsters full of overheated GPU's that need to be constantly replaced, while your phone is full of mostly unused compute capacity and will barely break a sweat if it's only serving queries for a single user at a time.

They don't seem to give much thought to the energy usage per user (or what this will potentially do to your phone battery), or how different phone-sized vs data center-sized models are in terms of capability.

This is pretty much true for all applications.
This is neither a controversial take nor a reason to prefer third-party hosting over self-hosting, so I don't think the internet really needs to be ready for it.
Using only this dimension in a vacuum, it sounds like an easy choice, but we're extremely early in this market, and the big providers are already a mess of pricing choices, pricing changes, and sudden quota adjustments for consumers.

Plus, a Mac that's not running inference idles down to 1-5W, only drawing power when it needs to. Datacenters must maximize usage, individuals and their devices don't have to.

A Mac is also the rest of the personal computer!

But it's simply an economic fact that EoS will be more efficient with a task that's so easy to offload somewhere else.
It's so interesting to think about how much power it takes these machines to "think". I think I had a vague notion that it was "a lot" but it's good to put a number on it.

If DS4 Flash peaks at 50W and is 280B parameters, does that mean DS4 Pro at 1.6T parameters would likely be 300W or so? And the latest GPT 5 and Opus which feel maybe comparable-ish around 500W? Is it fair to say that when I'm using Claude Code and it's "autofellating" or whatever I'm burning 500W in a datacenter somewhere during that time?

There isn't a relationship between parameter size and energy use like that. You could run a 280B parameter model on a Raspberry Pi with a big SSD if you were so determined. The energy use would be small, but you would be waiting a very long time for your response.

Data center energy use isn't simple to calculate because servers are configured to process a lot of requests in parallel. You're not getting an entire GPU cluster to yourself while your request is being processed. Your tokens are being processed in parallel with a lot of other people's requests for efficiency.

This is why some providers can offer a fast mode: Your request gets routed to servers that are tuned to process fewer requests in parallel for a moderate speedup. They charge you more for it because they can't fit as many requests into that server.

You're thinking about power use, not energy. There are systems that can more directly minimize energy per operation at the cost of high latency but they look more like TPUs than Raspberry Pi's.
Batching lowers that, since the model is read once from memory. Activation accumulation doesn't scale as nicely
Energy use for any given request is going to be roughly proportional to active parameters, not total. That would be something like 13B for Flash and 49B for Pro. So you'd theoretically get something like 190W if you could keep the same prefill and decode speed as Flash, which is unlikely.
Power isn't proportional to parameters. It may be vaguely proportional to tokens/s although batching screws that up.

Claude Sonnet is probably running on a 8 GPU box that consumes 10 kW while Opus might use more like 50 kW but that's shared by a bunch of users thanks to batching.

Not everybody might realize this, but this is a truly excellent and very impressive result. Most models on my M4 Max run at 150W consumption.
Power consumption numbers aren't useful for efficiency calculations without also considering the tokens per second for the same model and quantization.

I could write an engine that only uses 10W on your machine, but it wouldn't be meaningful if it was also 10X slower.

More power consumption is usually an indicator that the hardware is being fully utilized, all things equal (comparing GPU to GPU or CPU to CPU, not apples to oranges)

I'm sorry I don't understand. From the way you frame it, and the sentiment of the replies, seems like this is some scary big number. MacBook M3 Max is a beefy machine and doing inference means it's going at full send. 50W is... what tiny appliances consume. Sure it's more than reading emails but... it's still not a number to be shocked at. An on-the-go laptop has a TDP (max rated power) of 45W. Regular work laptop is 70W. Gaming laptop 230W. The servers I have in the lab on which I run benchmarks counting syscalls per seconds for days on end (you know, performance engineering!) are now going north of 1kW.

Washing machine 900W. Hair dryer 1500W. Pizza oven 2000W. So yeah, you say 50W, yeah sure same as video rendering or gaming I guess, yet not really an OMG-level number.

And frankly I'm not quite sure there's anything like economy of scale where it gets more efficient if you serve more users (like some sibling comments seem to imply).

Last thing, and I know many know but also many others don't or have forgotten: Watts is a rate of consumption, not an absolute amount. That is Joule, energy. So you say 50W, but what you pay for (or the planet pays, whatever) generally is the amount of energy, hence you need to say for how long that consumption was sustained. 50W over 2 hours, that's 100 Joules, the actual resource you consumed and paid for.

Power (watts) is like speed (m/s). You say 50 miles an hour, need to say how long was the drive, so we know how far you got.

50 watts over 2 hours is 100 watt hours (Wh) which is 360 kJ. A joule is a watt second. For reference, battery capacity is often measured in Wh and household electric power use in kWh.

Also, datacenter scale devices are almost certainly designed to minimize energy use per operation given comparable latency. You can still compete as an on prem consumer by (1) repurposing your existing hardware, which saves on high CapEx costs, (2) increasing latency, getting your answer computed in a longer time, which probably saves at least some power by design if you can leverage e.g. NPUs, or (3) running smaller or more bespoke models that aren't worthwhile for the bigger players to serve at scale.

There's also a likely gain in serving more requests in parallel, but it may have more to do with successfully amortizing memory access for model weights than any inherent increase in efficiency. Anyway, I've argued in sibling comments that you perhaps can also leverage this on consumer hardware for the special case of DeepSeek V4.

> _50 watts over 2 hours is 100 watt hours (Wh) which is 360 kJ._

Yes of course that was a brain fart of mine. Watt is Joule per second not certainly Joule per hour. I made the point of "lecturing" readers on power v. energy since Antirez (OP) wrote _"50W of energy usage..."_ (instead of power consumption) and it's a mistake people often make. So my side point was: ok 50W but for how long.

The other thing I'm arguing is 50W is nothing to be shocked by. I would like to see an argument for the opposite. I'd like to know what's the power consumption of playing eg. Baldur's Gate for a couple hours on a gaming rig and I wager we surpass that by a margin.

Now, the data center economy of scales. You're saying they almost certainly exists. Okay whatever I don't know. Requests served in parallel. Amortizing memory access for model weights. Likely. I'm writing this with some thinly veiled dismissive attitude because I believe that it would be very useful to have hard data on whether or not serving many users v. just one user makes LLMs more efficient. It's an important point with wide ranging implications.

If there is scale, like you claim, and one day a wealthy patron gifts me a 40k USD rig where I can run a frontier LLM locally, then I'd still be making selfish use of the commons (energy, which belong to the planet, all of us, that kinda stuff) because the efficient/responsible choice is to pool and use a cloud vendor (or pool your rig with neighbors etc).

But saying a machine can be more efficient if it serves many users sounds to me a bit like nine women making a baby in a month.

Keep in mind, I said serving many requests in parallel, not just many users. In fact it's even more efficient if you can batch the requests of a large subagent swarm in parallel since this allows for sharing a big chunk of context/KV cache not just the model weights. That's why I raised the possibility of leveraging this same efficiency with DeepSeek V4. If as a user I can get into the habit of just firing off a request to be cranked on in the background and be completed whenever, and I reach a compute-limited performance workload (just like the big inference labs that serve many users concurrently, only on a smaller scale since the overall compute bottleneck hits sooner) that's quite new wrt. local models. It used to be that we could only do that by spending huge amounts of money on very fast RAM and/or scaling out to multiple nodes.

A big cloud vendor does not face the same opportunity, they cannot leverage the repurposing of your own existing hardware. And they'll definitely want to minimize latency in order to get maximum throughput/utilization from the hardware they did buy, even at an emergy cost. That's why I was careful to note latency as a possible factor before.

Ah ok, sharing context/KV cache, I can see that helping. I need to learn more about DS V4, you seem to hint it has some advantages over previous generations in this respect. I haven't followed that closely to quite catch this argument, I'll check it out.
The basic argument is that its KV cache is roughly an order of magnitude more compact than previous Chinese models, which were already very compact compared to the likes of Gemma 4 (though that example is a bit of an extreme). If you pair this with the basic facts of how to maximize LLM inference performance at scale (this was recently talked about in a video lecture on the Dwarkesh Patel YouTube podcast) the case for doing slow batched inference on prem with DeepSeek V4, perhaps even with memory offload, becomes, as I see it, quite obvious. Of course, I'd like to be proven wrong!
I think I’ve seen about 60 watt total system whenever I’ve used a local model on a MacBook Pro or a Mac Studio. Baseline for the Mac Studio is like 10 W and like 6 W for the MacBook Pro.
That a serious number? By the way, how does a hardware normie like me even measure this?
Most components have built in power measurement (although some are more accurate than others). Apps like Intel Power Gadget, Mx Power Gadget, Afterburner, Adrenalin, etc. can show power usage in real time.
equals 2 or 3 human brains in power usage. Amazing work!
True quantitatively, not qualitatively. DeepSeek V4 is not capable of doing what a human brain can do, of course, but for the tasks it can do, it can do it at a speed which is completely impossible for a human, so comparing the two requires some normalization for speed.
I'm sure human brain, at least my present brain, is incapable of many things DeepSeek V4 can do. Qualitatively.