Hacker News new | ask | show | jobs
by steren 317 days ago
> I would never want to use something like ollama in a production setting.

We benchmarked vLLM and Ollama on both startup time and tokens per seconds. Ollama comes at the top. We hope to be able to publish these results soon.

3 comments

you need to benchmark against llama.cpp as well.
Did you test multi-user cases?
Assuming this is equivalent to parallel sessions, I would hope so, this is like the entire point of vLLM
vllm and ollama assume different settings and hardware. Vllm backed by the paged attention expect a lot of requests from multiple users whereas ollama is usually for single user on a local machine.