| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by junrushao1994 1086 days ago

One of the authors here. Glad it’s on HackerNews!

There are two points I personally wanted to make through this project:

1) With a sufficiently optimized software stack, AMD GPUs can be sufficiently cost-efficient to use in LLM serving; 2) ML compilation (MLC) techniques, through its underlying TVM Unity software stack, are the best fit in terms of cross-hardware generalizable performance optimizations, quickly delivering time-to-market values, etc.

So far, to the best of our knowledge, MLC LLM delivers the best performance across NVIDIA and AMD GPUs in single-batch inference on quantized models, and batched/distributed inference is on the horizon too.

10 comments

htirwklj4523432 1085 days ago

The numbers look amazing.

Can you comment on how difficult it was to achieve this, and what the relative advantages b/w cards ? AFAIR, AMD cards were not not deemed competitive with Nvidia in DL space largely because of the amazing job Nvidia pulled off with CUDNN and its conv. kernels.

LLMs etc. OTOH doesn't really depend on convolutions (atleast the pure transformer bits), and instead depends a lot more on plain old GEMM + low-bit float/int compute.

junrushao1994 1085 days ago

> Can you comment on how difficult it was to achieve this, and what the relative advantages b/w cards?

Thanks for asking! I personally believe TVM Unity is a proper software stack for ML compilation (MLC), and its existing optimizations (e.g. TensorCore offloading) can be transparently transferred to AMD/Intel/Apple/mobile GPUs without too much engineering effort.

Of course my claim is limited to ML workloads. Not an expert outside the ML world, so I couldn't say for general HPC.

gsuuon 1085 days ago

Congrats Junru! I'm not on AMD but love seeing progress in this project. Excited for batched inference -- I didn't think it'd be useful for me but I've realized batched inference is also useful for a single user / edge device workload.

Btw - I got biased sampling working in ad-llama! Catching up to guidance slowly but surely :)

junrushao1994 1085 days ago

This is amazing to hear Steven! (Sorry I locked myself out of discord a couple of days ago...) I'm sure there's bunch of features missing like biased sampling you mentioned, and more than happy to merge PRs if you'd love to :)

PeterStuer 1085 days ago

Thank you for this work. I will be staying on nvidia for now, but applaud any progress towards much needed credible competition in the consumer/enthusiast AI hardware space.

One question: given your experience, when would you predict a near parity in software stack support between te different platforms, so that a choice of GPU becomes one mostly of price/performance? It does not need to be like the AMD/Intel in the CPU market where a consumer will have no doubts about software compatibility, but let's say like the gaming gpu market where a game having problems on a gpu architecture is a newsworthy exception that is quickly corrected.

PeterStuer 1084 days ago

Honestly at a loss why this got downvoted.

JonChesterfield 1086 days ago

Did the ROCm 5.6 toolchain work for you out of the box? If not, what sort of hacking / hand holding did it need?

I don't know whether there's a LLM inference benchmark in the CI suite, if not perhaps something like this should be included in it.

junrushao1994 1086 days ago

ROCm has improved a lot over the past few months, and now ROCm 5.6 seems to work out of box by just following this tutorial: https://rocm.docs.amd.com/en/latest/deploy/linux/installer/i.... TVM Unity, the underlying compiler MLC LLM uses, seems to work out of box too on ROCm 5.6 - from Bohan Hou who sets up the environment

JonChesterfield 1086 days ago

Awesome. I'm going to paste that into the rocm dev channel. Actual positive feedback on HN, novel and delightful. Thank you for the blog post too!

fweimer 1085 days ago

https://rocm.docs.amd.com/en/latest/release/gpu_os_support.h... and https://community.amd.com/t5/rocm/new-rocm-5-6-release-bring... suggest that Linux support is really limited at this point. Is this information inaccurate?

JonChesterfield 1085 days ago

Depends what support means to you really. The docs use support to mean things AMD tested and expect to work, modulo errata.

If you're building the stack from source or found it in a Linux repo, decent odds it'll work for you. More likely to work on gfx9 or gfx10 than the older cards. I think that's roughly the last five years.

If you use the official distribution, some parts are compiled to gpu-specific machine code and if your gpu isn't one of those, you can't use that library. I think there's a reluctance to compile the libs for GPUs that aren't in the internal CI in case they don't work.

As an anecdote, I do most development on unsupported hardware, unsupported distro and unsupported kernel, with the upstream driver, using whatever was on llvm main that morning. That mostly works despite positioning myself as most likely to run into bugs.

bavell 1085 days ago

I'm still on rocm 5.4, been working great on my 6750XT for the past few months (Arch).

kstenerud 1085 days ago

Are there any docker images containing this? I'd like to avoid getting into dependency hell with other software on my system, as happens all too often with new technologies.

scrps 1085 days ago

There are thankfully, quite a few, ive mostly used rocm/rocm-terminal and rocm/rocm-dev.

https://hub.docker.com/u/rocm

crowwork 1086 days ago

Yes, it works out of box and the blog contains a prebuilt python package that you can try out

Const-me 1086 days ago

Have you tested Vulkan API on the 7900 XTX? Was it faster or slower than ROCm?

junrushao1994 1085 days ago

Generally speaking I expect Vulkan to be slower than ROCm given it's designed for generic gaming across GPU vendors, so the takeaway is, whenever ROCm is available and usable, we should use ROCm. And it's the same for CUDA vs Vulkan.

shmerl 1085 days ago

What slows it down? Shouldn't Vulkan expose compute queues of the GPUs as well?

Const-me 1085 days ago

I don't have any expectations, but there're reasons for Vulkan to be faster.

It's a mature technology used my millions of people every day.

Unlike GPGPU compute, for videogames performance directly affects usability.

For these reasons, the software on all levels of the stack might be more optimized.

KingOfCoders 1085 days ago

Can I use two at the same time? Two 7900 XTX would be the price of 1 4090 but with much higher performance (260tok/sec)

sullx 1085 days ago

This is coming! Myself and others at OctoML and in the TVM community are actively working on multi-gpu support in the compiler and runtime. Here are some of the merged and active PRs on the multi-GPU (multi-device) roadmap:

Support in TVM’s graph IR (Relax) - https://github.com/apache/tvm/pull/15447 Support in TVM’s loop IR (TensorIR) - https://github.com/apache/tvm/pull/14862 Distributed dialect of TVM’s graph IR for multi-node (GSPMD-type): https://github.com/apache/tvm/pull/15289

The first target will be LLM's on multiple NVIDIA GPUs but as with all of MLC-LLM effort, the approach will generalize to other hardware including AMD's wonderful hardware.

3abiton 1085 days ago

This exciting, but still it is very apparent more time is needed.

KingOfCoders 1085 days ago

<3

tails4e 1086 days ago

When you say best performance on nvidia, do you mean against any other method of running this model an nvidia card?

brucethemoose2 1085 days ago

I can confirm this, mlc is shockingly fast on my RTX 2060.

The catch is:

- MLC's quantization is somewhat different (though I havent run any perplexity tests yet)

- There is no CPU offloading (or splitting onto an IGP) like Llama.cpp yet (unless its new and I missed it).

junrushao1994 1085 days ago

True and there are some other issues to be addressed. Those two particular issue is on our roadmap.

Regarding quantization, we wanted to develop a code path that absorbs any quantization formats, for example, those from GGML or GPTQ, so that they could be all used. ML compilation (MLC) is agnostic to any quantization formats, but we just haven't exposed such abstractions yet.

On CPU offloading, imagine if you are writing PyTorch, it should be as simple as a one-liner `some_tensor.cpu()` to bring something down to host memory, and `some_tensor.cuda()` to get it back to CUDA - seems a low-hanging fruit but it's not implemented yet in MLC LLM :( Lots of stuff to do and we should make this happen soon.

junrushao1994 1086 days ago

yeah we tried out popular solutions like exllama and llama.cpp among others that support inference of 4bit quantized models

bravura 1085 days ago

Thanks! Just curious why there is no "team" or "about us" page? It's nice sharing credit, but it also is a little unsettling when blog posts do not name contributors.

Good work though. And you have an activity community on github, congratulations.

junrushao1994 1085 days ago

Well, I'm very much into true open source, and my belief is that any contributor is automatically part of the team :)

azeirah 1085 days ago

I know plenty of open-source projects who list and thank every individual contributor. The website could do that too!

junrushao1994 1085 days ago

That's a great idea! We should dig around and see if there's any plugin to use

postmeta 1085 days ago

is this similar to the mosaicml amd MI250 vs nvidia A100 results but with consumer grade hardware? https://www.mosaicml.com/blog/amd-mi250

might be interesting to team up

melony 1086 days ago

Does it work with WSL2?

junrushao1994 1086 days ago

Really depends on how good ROCm support for WSL2 is. Our team don't have a windows machine so could not verify ourselves, but if you got ROCm set up properly on WSL2, MLC LLM should work out of the box

crowwork 1085 days ago

You can also try out the vulkan backend, which we know should work for windows, although speed might be slower than rocm

gsuuon 1085 days ago

FWIW I did get the CUDA backend running via WSL2