|
|
|
|
|
by alecco
861 days ago
|
|
Most of the workloads have not yet caught up with Nvidia Hopper optimizations. The key are the Tensor Cores. Google came up with the TPU (2015) for GEMM. Nvidia just took the idea and ran with it (Turing 2018). So it wasn't that Nvidia had a head start on this. Now Nvidia Hopper is ahead of everybody else by far. They have things like async memory management for the tensor cores (Tensor Memory Accelerator), mixed precission, and even FP8 support. Most of the software out there has not yet caught up with that. And even Nvidia's own Tensor Engine software is not making the best use of it (Microsoft Research October 2023, backward pass and cross-device communication). Last year FlashAttention was a game changer for performance by doing memory load optimizations. Nobody was optimizing properly for Nvidia in Transformer models. |
|