Hacker News new | ask | show | jobs
by lhl 381 days ago
Rather than speculating another option is to just measure things. I churned through billions of tokens for evals and synthetic data earlier this year, so I did some of that. On an H100 node, a Llama3 70B FP8 at concurrency=128 generated at about 0.4 J/token (this was estimating node power consumption and multiplying by a generous PUE, 1.2X or something like that) - it was still 120X cheaper than the 48 J/token estimates of cost to run the 175B GPT-3 on 2021-era Microsoft DC1 hardware (Li et al. 2023) and 10X cheaper than the 3-4 J/token empirical measurements to run LLaMA-65B on V100/A100 HPC nodes (Samsi et al 2023).

Anyway, at 0.4 J/token, at a cost of 5 cents/kWh, is about 0.5 cents/million tokens. Even at 50% utilization you're only up to 1.1 cents/M tokens. Artificial Analysis reports the current average price of Llama3.3 70B to be about $0.65/M tokens. I'd assume most of the cost you're paying for is probably the depreciation schedule of the hardware.

Note that of course, modern-day 7B class models stomp on both those older models so you could throw in another 10X lower cost if you're going to quality adjust. Also, I did minimal perf tuning - I used FP8, and W8A8-INT8 both is faster and has slightly better quality (in my functional evals). I also used -tp 8 for my system. -tp 4 w/ model parallelism and cache-aware routing you should also be able to increase throughput a fair amount. Also, speculative decode w/ a basic draft model would give you another boost. And this was tested at the beginning of the year, so using vLLM 0.6.x or so - the vLLM 1.0 engine is faster (better graph building, compilation, scheduling). I'd guess that if you were conscientious about just optimizing you could probably get at least another 2X perf free with basically just "config".

2 comments

My only question about this is the concurrency : is it really easy to leverage it when you need to serve to clients without much latency ? I don't know much about this.
Yeah, actually for my batch usage, I usually push to 256+ concurrency, but on H100s at least, currently 64-128 is about the bend of the curve for where latency starts going out of control (this depends a lot on your context length and kvcache optimizations, though).

What I do for testing is that I will run a benchmark_serving sweep (I prefer ShareGPT for a standard set that is slightly more realistic for caching) with desired concurrency (eg 4-1024 or something like that) and then plot TTFT vs Total Throughput and graph Mean, P50, and P99 - this will give you a clear picture what your concurrency/throughput for a given desired latency.

Yes, if we discount the billion or so Facebook spent to train Llama3.
No, let's add it. The cost for an inference provider to deploy a trained and weights available existing model is $0 (or whatever you want to add for the HF download of the weights). Open weight models simply exist now. Deal with it?

If you would like to someone add that somehow as a line item, perhaps you should add the full embodied energy cost of Linux (please include the entire history of compute since it wouldn't exist without UNIX), or perhaps the full military industrial complex costs from the invention of the transistor? We could go further.

I love it! Can't forget the accumulated carbon costs of all the experimentation it took to master fire, ceramics, and metals smelting.