| HN Mirror

Depends on the model. Each runner needs to implement support when there are new architectures, and they all seemingly focuses on different things. As far as I've gathered so far, vLLM focuses on inference speed, SGLang on parallizing across multiple GPUs, Ollama on being as fast out the door with their implementation as possible, sometimes cutting corners, llama.cpp sits somewhere in-between Ollama and vLLM. Then LM Studio seems to lag slightly behind with their llama.cpp usage, so I'm guessing that's the difference between LM Studio and building llama.cpp from source today.