|
|
|
|
|
by MaximilianEmel
859 days ago
|
|
Could it be that for today's workloads are perfect for Nvidia GPUs? Not because it is an ideal chip, but rather because of the availability of them, the current workloads are made to take advantage of Nvidia GPUs' architecture. |
|
Google came up with the TPU (2015) for GEMM. Nvidia just took the idea and ran with it (Turing 2018). So it wasn't that Nvidia had a head start on this.
Now Nvidia Hopper is ahead of everybody else by far. They have things like async memory management for the tensor cores (Tensor Memory Accelerator), mixed precission, and even FP8 support.
Most of the software out there has not yet caught up with that. And even Nvidia's own Tensor Engine software is not making the best use of it (Microsoft Research October 2023, backward pass and cross-device communication).
Last year FlashAttention was a game changer for performance by doing memory load optimizations. Nobody was optimizing properly for Nvidia in Transformer models.