For Llama2-70B, it runs 4-bit quantized Llama2-70B at:
- 34.5 tok/sec on two NVIDIA RTX 4090 at $3k
- 29.9 tok/sec on two AMD Radeon 7900XTX at $2k
- Also it is scales well with 8 A10G/A100 GPUs in our experiment.
Details:
- Blog post: https://blog.mlc.ai/2023/10/19/Scalable-Language-Model-Infer...
- Project: https://github.com/mlc-ai/mlc-llm