| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tubby12345 1743 days ago
	There's another level of speed you can unlock by combining with https://pytorch.org/docs/master/notes/cuda.html#cuda-graphs. i got (i kid you not) 20x speed on batch size = 1 inference by first using tensorrt to fuse kernels and then "graphing". and even for larger batch size it's just free perf gains https://imgur.com/OKRbUNw

1 comments

l-lousy 1743 days ago

Holy crap that’s amazing! How complex is your model? And are there lots of parallelizable parts like filters or is it recurrent?

link

tubby12345 1743 days ago

the model that i got 20x on is very simple - just a couple of convs and relus - it's for edge detection on a pseudo-embedded platform (jetson) - but the wins from cuda graphs are from two things: complete elimination of kernel individual launch times and complete elimination of allocations for intermediate tensors, which dominate runtime for small kernel sizes (e.g. batch size = 1).

link

mtthtlt 1743 days ago

That is so cool ! May I ask at which resolution you had those results ?

We managed to get up to 10x for very low resolutions (160) for a resnet101 but it usually plateaus for high resolutions (above 896x896) at a 1.7~1.9 speed-up. Although using Int8 gives even higher speed-ups (~times 3.6 for 896x896 input), for some tasks it degrades the performance too much.

I will definitely try your setup :)

link

tubby12345 1743 days ago

indeed small resolutions (64x64) but i mean 2x speed is still nothing to sneeze at.

link

mtthtlt 1743 days ago

I agree, especially when it is free accuracy wise :)

link