Scaling LLama2-70B with Multiple Nvidia/AMD GPU

Y	Hacker News new \| ask \| show \| jobs

	Scaling LLama2-70B with Multiple Nvidia/AMD GPU (blog.mlc.ai)
	13 points by junrushao1994 975 days ago

5 comments

junrushao1994 975 days ago

Machine Learning Compilation (MLC) now supports compiling LLMs to multiple GPUs.

For Llama2-70B, it runs 4-bit quantized Llama2-70B at:

- 34.5 tok/sec on two NVIDIA RTX 4090 at $3k

- 29.9 tok/sec on two AMD Radeon 7900XTX at $2k

- Also it is scales well with 8 A10G/A100 GPUs in our experiment.

Details:

- Blog post: https://blog.mlc.ai/2023/10/19/Scalable-Language-Model-Infer...

- Project: https://github.com/mlc-ai/mlc-llm

link

brucethemoose2 975 days ago

For those suffering from deceptive graph fatigue, this is impressive.

exLlama is blazing fast. Even if they just benched exllamav1, exllamav2 is only a bit faster, at least on my single 3090 in a similar environment.

vLLM is focused more on batching performance, but even then MLC/TVM looks like its putting up a fight without batching.

I am a bit fatigued with llama backends myself, and it looks like this won't help me run 70B in a single 3090, but I need to dig into mlc again.

link

junrushao1994 975 days ago

Yeah thanks for sharing! This is definitely super valuable data and insights :)

Regarding exllama-V2, MLC/TVM does benchmark against it:

- Single GPU: https://github.com/mlc-ai/llm-perf-bench#int4-quantized-sing...

- Multi GPU: Figure 2 in the blog: http://blog.mlc.ai/2023/10/19/Scalable-Language-Model-Infere...

> vLLM focuses more on batching performance

Exactly. vLLM doesn’t optimize for latency-first scenarios as it focuses on throughput, i.e. batching. This particular blog post instead focuses particular on latency, i.e. the fastest you could possible get with those many GPUsz

Regarding batching, it is coming pretty soon, and we will have another blog post on this.

link

l3jin 975 days ago

Universal deployment is indeed attractive. I have tested the Llama2-70B on 7900 XTX. Love the performance!

Also saw a report earlier today on MLC’s discord about AMD MI-100:

GPU Count | Model Size | Prefill Speed | Decode Speed

1 | 33b | 102.2 | 22.3

2 | 33b | 112.3 | 33.0

4 | 33b | 144.8 | 41.2

link

jinhongyii 975 days ago

The performance is really amazing with such low cost.

link

zhye 975 days ago

Serving LLM with AMD GPUs to serve LLM looks impressive, MLC is evolving fast! Any results on NVLink/xGMI instead of PCIe?

link