Hacker News new | ask | show | jobs
by raphaelj 926 days ago
Do we have estimates of the energy requirements for these models?

I just did some napkin math, looks like inference on a 30B model with a GTX 4090 should get you about 30 tokens/sec [1], or 100k tokens/hour.

Considering such systems consume about 1 kW, that's about 10 kWh/1M tokens.

Based on the current cost of electricity, I don't think anyone could get below 2 ~ 4 $ per 1M token for a 30B model.

[1] https://old.reddit.com/r/LocalLLaMA/comments/13j5cxf/how_man...

8 comments

FWIW - I need to remeasure but - IIRC my system with a 4090 only uses ~500w (maybe up to 600w) during inference of LLMs, the LLMs have a lot harder time saturating the compute compared to stable diffusion I'm assuming because of the VRAM speed (and this is all on-card, nothing swapping from system memory). The 4090 itself only really used 300~400w most of the time because of this.

If you consider 600w for the entire system, that's only 6kWh/1M token, for me 6kWh @0.2USD/kWh is 1.2USD/1M tokens.

And that's without the power efficiency improvements that an H100 has over the 4090. So I think 2$/1M should be achievable once you combine the efficiencies of H100s+batching, etc. Since LLM's generally dwarf the network delay anyway, you could host in places like washington for dirt cheap prices (their residential prices are almost half of what I used for calculations)

Are you using batch size 1 with LLMs? Larger batch sizes get much higher utilization.
Well with those numbers, I pay $0.1/kWh so theoretically $0.6/1M tokens
Depends how and where you source your energy. If you invest in your own solar panels and batteries, all that energy is essentially fixed price (cost of the infrastructure) amortized over the lifetime of the setup (1-2 decades or so). Maybe you have some variable pricing on top for grid connectivity and use the grid as a fallback. But there's also the notion of selling excess energy back to the grid that offsets that.

So, 10kwh could be a lot less than what you cite. That's also how grid operators make money. They generate cheaply and sell with a nice margin. Prices are determined by the most expensive energy sources on the grid in some markets (coal, nuclear, etc.). So, that pricing doesn't reflect actual cost for renewables, which is typically a lot lower than that. Anyone consuming large amounts of energy will be looking to cut their cost. For data centers that typically means investing in energy generation, storage, and efficient hardware and cooling.

During the crypto boom there were crypto miners in China who got really cheap electricity from hydroelectric dams built in rural areas. Shipping electricity long distance is expensive (both in terms of infrastructure and losses - unless you pay even more for HVDC infrastructure), so they were able to get great prices as local consumers of "surplus" energy.

That might be a great opportunity for cheap LLMs too.

Batching changes that equation a fair bit. Also these cards will not consume full power since llm are mostly limited by memory bandwidth and the processing part will get some idle time.
Is $0.2-0.4/kWh a good estimate for price paid in a data center? That’s pretty expensive for energy, and I think vPPA prices at big data centers are much lower (I think 0.1 is a decent upper bound in the US, though I could see EU being more expensive by 2x).
The 4090 is considerably more power-hungry compared to e.g. an A100, however.
If comparing apples to apples, the 4090 needs to clock up and consume about 450 W to match the A100 at 350W. Part of that is due to being able to run larger batches on the A100, which gives it an additional performance edge, but yes in general the A100 is more power efficient.
Mistral-small explicitly has inference costs of a 12.9b, but more than that, it's probably ran with batch size of 32 or higher. They'll worry more about offsetting training costs than about this.

Here's how it works in reality:

https://docs.mystic.ai/docs/mistral-ai-7b-vllm-fast-inferenc...

Well the 4090 is certainly less efficient on this. They are using H100's or better no doubt. If they optimize for TPUs, it'll be even better.
I get 40 tok/sec on my M3 Max on various 34B models, I gather a desktop 4090 would be at least 80?