Hacker News new | ask | show | jobs
by mtthtlt 1696 days ago
I think using TRTorch[1] can be quick way to generate both easy to use and fast inference models from PyTorch.

It compiles your model, using TensorRT, Ahead of Time and enables you to use the compiled model through torch.jit.load("your_trtorch_model.ts") in your application. Once compiled, you no longer need to keep your model's code in the application (as for usual jit models).

The inference time is on par with TensorRT and it does the optimizations for you as well. You can quantize your model to FP16 or Int8 using PTQ as well and it should give you an additional speed up inference wise.

Here is a tutorial[2] to leverage TRTorch.

[1] https://github.com/NVIDIA/TRTorch/tree/master/core [2] https://www.photoroom.com/tech/faster-image-segmentation-trt...

2 comments

There's another level of speed you can unlock by combining with https://pytorch.org/docs/master/notes/cuda.html#cuda-graphs. i got (i kid you not) 20x speed on batch size = 1 inference by first using tensorrt to fuse kernels and then "graphing". and even for larger batch size it's just free perf gains

https://imgur.com/OKRbUNw

Holy crap that’s amazing! How complex is your model? And are there lots of parallelizable parts like filters or is it recurrent?
the model that i got 20x on is very simple - just a couple of convs and relus - it's for edge detection on a pseudo-embedded platform (jetson) - but the wins from cuda graphs are from two things: complete elimination of kernel individual launch times and complete elimination of allocations for intermediate tensors, which dominate runtime for small kernel sizes (e.g. batch size = 1).
That is so cool ! May I ask at which resolution you had those results ?

We managed to get up to 10x for very low resolutions (160) for a resnet101 but it usually plateaus for high resolutions (above 896x896) at a 1.7~1.9 speed-up. Although using Int8 gives even higher speed-ups (~times 3.6 for 896x896 input), for some tasks it degrades the performance too much.

I will definitely try your setup :)

indeed small resolutions (64x64) but i mean 2x speed is still nothing to sneeze at.
I agree, especially when it is free accuracy wise :)
As someone who's been vaguely interested in PyTorch inference optimization but has never had a clear jumping-in point, thank you for this comment! Nice to see a clear two-sentence explanation that actually makes sense to me, makes me really want to try out TRTorch and TensorRT!

Have a nice day internet stranger.