|
I'm sorry I don't understand. From the way you frame it, and the sentiment of the replies, seems like this is some scary big number. MacBook M3 Max is a beefy machine and doing inference means it's going at full send. 50W is... what tiny appliances consume. Sure it's more than reading emails but... it's still not a number to be shocked at. An on-the-go laptop has a TDP (max rated power) of 45W. Regular work laptop is 70W. Gaming laptop 230W. The servers I have in the lab on which I run benchmarks counting syscalls per seconds for days on end (you know, performance engineering!) are now going north of 1kW. Washing machine 900W. Hair dryer 1500W. Pizza oven 2000W. So yeah, you say 50W, yeah sure same as video rendering or gaming I guess, yet not really an OMG-level number. And frankly I'm not quite sure there's anything like economy of scale where it gets more efficient if you serve more users (like some sibling comments seem to imply). Last thing, and I know many know but also many others don't or have forgotten: Watts is a rate of consumption, not an absolute amount. That is Joule, energy. So you say 50W, but what you pay for (or the planet pays, whatever) generally is the amount of energy, hence you need to say for how long that consumption was sustained. 50W over 2 hours, that's 100 Joules, the actual resource you consumed and paid for. Power (watts) is like speed (m/s). You say 50 miles an hour, need to say how long was the drive, so we know how far you got. |
Also, datacenter scale devices are almost certainly designed to minimize energy use per operation given comparable latency. You can still compete as an on prem consumer by (1) repurposing your existing hardware, which saves on high CapEx costs, (2) increasing latency, getting your answer computed in a longer time, which probably saves at least some power by design if you can leverage e.g. NPUs, or (3) running smaller or more bespoke models that aren't worthwhile for the bigger players to serve at scale.
There's also a likely gain in serving more requests in parallel, but it may have more to do with successfully amortizing memory access for model weights than any inherent increase in efficiency. Anyway, I've argued in sibling comments that you perhaps can also leverage this on consumer hardware for the special case of DeepSeek V4.