Exactly. vLLM doesn’t optimize for latency-first scenarios as it focuses on throughput, i.e. batching. This particular blog post instead focuses particular on latency, i.e. the fastest you could possible get with those many GPUsz
Regarding batching, it is coming pretty soon, and we will have another blog post on this.
For Llama2-70B, it runs 4-bit quantized Llama2-70B at:
- 34.5 tok/sec on two NVIDIA RTX 4090 at $3k
- 29.9 tok/sec on two AMD Radeon 7900XTX at $2k
- Also it is scales well with 8 A10G/A100 GPUs in our experiment.
Details:
- Blog post: https://blog.mlc.ai/2023/10/19/Scalable-Language-Model-Infer...
- Project: https://github.com/mlc-ai/mlc-llm