|
One of the authors here. Glad it’s on HackerNews! There are two points I personally wanted to make through this project: 1) With a sufficiently optimized software stack, AMD GPUs can be sufficiently cost-efficient to use in LLM serving;
2) ML compilation (MLC) techniques, through its underlying TVM Unity software stack, are the best fit in terms of cross-hardware generalizable performance optimizations, quickly delivering time-to-market values, etc. So far, to the best of our knowledge, MLC LLM delivers the best performance across NVIDIA and AMD GPUs in single-batch inference on quantized models, and batched/distributed inference is on the horizon too. |
Can you comment on how difficult it was to achieve this, and what the relative advantages b/w cards ? AFAIR, AMD cards were not not deemed competitive with Nvidia in DL space largely because of the amazing job Nvidia pulled off with CUDNN and its conv. kernels.
LLMs etc. OTOH doesn't really depend on convolutions (atleast the pure transformer bits), and instead depends a lot more on plain old GEMM + low-bit float/int compute.