| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ml_hardware 1592 days ago

The 9x speedup is a bit inflated... it's measured at a reference point of ~8k GPUs, on a workload that the A100 cluster is particularly bad at.

When measured at smaller #s of GPUs which are more realistic, the speedup is somewhere between 3.5x - 6x. See the GTC Keynote video at 38:50: https://youtu.be/39ubNuxnrK8?t=2330

Based on hardware specs alone, I think that training transformers with FP8 on H100 systems vs. FP16 on A100 systems should only be 3-4x faster. Definitely looking forward to external benchmarks over the coming months...