| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kkielhofner 659 days ago

Hah, it actually gets worse. What I was describing was the Triton ONNX backend with the OpenVINO execution accelerator[0] (not the OpenVINO backend itself). Clear as mud, right?

Your issue here is model performance with the additional challenge of offering it over a network socket across multiple requests and doing so in a performant manner.

Triton does things like dynamic batching[1] where throughput is increased significantly by aggregating disparate requests into one pass through the GPU.

A docker container for torch, ONNX, OpenVINO, etc isn't even natively going to offer a network socket. This is where people try to do things like rolling their own FastAPI API implementation (or something) only to discover it completely falls apart at any kind of load. That's development effort as well but it's a waste of time.

[0] - https://github.com/triton-inference-server/onnxruntime_backe...

[1] - https://docs.nvidia.com/deeplearning/triton-inference-server...

1 comments

backend-dev-33 659 days ago

> additional challenge of offering it over a network socket across multiple requests and doing so in a performant manner.

@kkielhofner thanks a lot! now I realize it. I see, there is even GRPC support in Triton, so it make sense.

link

kkielhofner 656 days ago

Make sure to check out the existing Triton client libraries:

https://github.com/triton-inference-server/client

link