| HN Mirror

Yeah thanks for sharing! This is definitely super valuable data and insights :)

Regarding exllama-V2, MLC/TVM does benchmark against it:

- Single GPU: https://github.com/mlc-ai/llm-perf-bench#int4-quantized-sing...

- Multi GPU: Figure 2 in the blog: http://blog.mlc.ai/2023/10/19/Scalable-Language-Model-Infere...

> vLLM focuses more on batching performance

Exactly. vLLM doesn’t optimize for latency-first scenarios as it focuses on throughput, i.e. batching. This particular blog post instead focuses particular on latency, i.e. the fastest you could possible get with those many GPUsz

Regarding batching, it is coming pretty soon, and we will have another blog post on this.