|
|
|
|
|
by raphaelj
926 days ago
|
|
Do we have estimates of the energy requirements for these models? I just did some napkin math, looks like inference on a 30B model with a GTX 4090 should get you about 30 tokens/sec [1], or 100k tokens/hour. Considering such systems consume about 1 kW, that's about 10 kWh/1M tokens. Based on the current cost of electricity, I don't think anyone could get below 2 ~ 4 $ per 1M token for a 30B model. [1] https://old.reddit.com/r/LocalLLaMA/comments/13j5cxf/how_man... |
|
If you consider 600w for the entire system, that's only 6kWh/1M token, for me 6kWh @0.2USD/kWh is 1.2USD/1M tokens.
And that's without the power efficiency improvements that an H100 has over the 4090. So I think 2$/1M should be achievable once you combine the efficiencies of H100s+batching, etc. Since LLM's generally dwarf the network delay anyway, you could host in places like washington for dirt cheap prices (their residential prices are almost half of what I used for calculations)