| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by woodson 1188 days ago

Not GP, but what NVidia Triton can do includes

- Dynamic batching while limiting latency to a set threshold

- Running multiple instances of a model, effectively load-balancing inference requests.

- Loading/unloading/running multiple versions of models dynamically, which is useful if you want to update (or roll back) your model while not interfering with existing inference requests.

Its client provides async based inference APIs, so you can easily put a FastAPI-based API server in front and don't necessarily need a queue (like Celery).