|
|
|
|
|
by ml_hardware
1544 days ago
|
|
The 9x speedup is a bit inflated... it's measured at a reference point of ~8k GPUs, on a workload that the A100 cluster is particularly bad at. When measured at smaller #s of GPUs which are more realistic, the speedup is somewhere between 3.5x - 6x. See the GTC Keynote video at 38:50: https://youtu.be/39ubNuxnrK8?t=2330 Based on hardware specs alone, I think that training transformers with FP8 on H100 systems vs. FP16 on A100 systems should only be 3-4x faster. Definitely looking forward to external benchmarks over the coming months... |
|