| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tubby12345 1743 days ago
	the model that i got 20x on is very simple - just a couple of convs and relus - it's for edge detection on a pseudo-embedded platform (jetson) - but the wins from cuda graphs are from two things: complete elimination of kernel individual launch times and complete elimination of allocations for intermediate tensors, which dominate runtime for small kernel sizes (e.g. batch size = 1).

1 comments

mtthtlt 1743 days ago

That is so cool ! May I ask at which resolution you had those results ?

We managed to get up to 10x for very low resolutions (160) for a resnet101 but it usually plateaus for high resolutions (above 896x896) at a 1.7~1.9 speed-up. Although using Int8 gives even higher speed-ups (~times 3.6 for 896x896 input), for some tasks it degrades the performance too much.

I will definitely try your setup :)

link

tubby12345 1743 days ago

indeed small resolutions (64x64) but i mean 2x speed is still nothing to sneeze at.

link

mtthtlt 1743 days ago

I agree, especially when it is free accuracy wise :)

link