|
|
|
|
|
by junrushao1994
980 days ago
|
|
Yeah thanks for sharing! This is definitely super valuable data and insights :) Regarding exllama-V2, MLC/TVM does benchmark against it: - Single GPU: https://github.com/mlc-ai/llm-perf-bench#int4-quantized-sing... - Multi GPU: Figure 2 in the blog: http://blog.mlc.ai/2023/10/19/Scalable-Language-Model-Infere... > vLLM focuses more on batching performance Exactly. vLLM doesn’t optimize for latency-first scenarios as it focuses on throughput, i.e. batching. This particular blog post instead focuses particular on latency, i.e. the fastest you could possible get with those many GPUsz Regarding batching, it is coming pretty soon, and we will have another blog post on this. |
|