|
Nvidia Triton Inference Server with the TensorRT-LLM backend: https://github.com/triton-inference-server/tensorrtllm_backe... It’s used by Mistral, AWS, Cloudflare, and countless others. vLLM, HF TGI, Rayserve, etc are certainly viable but Triton has many truly unique and very powerful features (not to mention performance). 100k DAU doesn’t mean much, you’d need to get a better understanding of the application, input tokens, generated output tokens, request rates, peaks, etc not to mention required time to first token, tokens per second, etc. Anyway, the point is Triton is just about the only thing out there for use in this general range and up. |
What I like about vLLM is the following:
- It exposes AsyncLLMEngine, which can be easily wrapped in any API you'd like.
- It has a logit processor API making it simple to integrate custom sampling logic.
- It has decent support for interference of quantized models.