| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mgreg 1014 days ago

Thanks to the folks at MLCommons we have some benchmarks and data to evaluate and track inference performance published today. Includes results from GPUs, TPUs, and CPUs as well as some power measurements across several ML use cases including LLMs.

"This benchmark suite measures how fast systems can process inputs and produce results using a trained model. Below is a short summary of the current benchmarks and metrics. Please see the MLPerf Inference benchmark paper for a detailed description of the motivation and guiding principles behind the benchmark suite."

https://mlcommons.org/en/inference-datacenter-31/

For example the latest TPU (v5) from Google scores 7.13 queries per second with an LLM. Looking at GCP that server runs $1.2 / hour on demand.

On Azure an H100 scores 84.22 queries per second with an LLM. Couldn't find the price for that but an A100 costs $27.197 per hour so no doubt the H100 will be more expensive than that.

7.13 / $1.2 = 5.94 queries/second/$ 84.22 / $27.197 (A100 Pricing) = 3.09 queries/second/$

[edited to include GCP TPU v5 and Nvidia H100 relative performance info for LLM Inference]