Hacker News new | ask | show | jobs
by tubby12345 1696 days ago
There's another level of speed you can unlock by combining with https://pytorch.org/docs/master/notes/cuda.html#cuda-graphs. i got (i kid you not) 20x speed on batch size = 1 inference by first using tensorrt to fuse kernels and then "graphing". and even for larger batch size it's just free perf gains

https://imgur.com/OKRbUNw

1 comments

Holy crap that’s amazing! How complex is your model? And are there lots of parallelizable parts like filters or is it recurrent?
the model that i got 20x on is very simple - just a couple of convs and relus - it's for edge detection on a pseudo-embedded platform (jetson) - but the wins from cuda graphs are from two things: complete elimination of kernel individual launch times and complete elimination of allocations for intermediate tensors, which dominate runtime for small kernel sizes (e.g. batch size = 1).
That is so cool ! May I ask at which resolution you had those results ?

We managed to get up to 10x for very low resolutions (160) for a resnet101 but it usually plateaus for high resolutions (above 896x896) at a 1.7~1.9 speed-up. Although using Int8 gives even higher speed-ups (~times 3.6 for 896x896 input), for some tasks it degrades the performance too much.

I will definitely try your setup :)

indeed small resolutions (64x64) but i mean 2x speed is still nothing to sneeze at.
I agree, especially when it is free accuracy wise :)