|
|
|
|
|
by tubby12345
1696 days ago
|
|
the model that i got 20x on is very simple - just a couple of convs and relus - it's for edge detection on a pseudo-embedded platform (jetson) - but the wins from cuda graphs are from two things: complete elimination of kernel individual launch times and complete elimination of allocations for intermediate tensors, which dominate runtime for small kernel sizes (e.g. batch size = 1). |
|
We managed to get up to 10x for very low resolutions (160) for a resnet101 but it usually plateaus for high resolutions (above 896x896) at a 1.7~1.9 speed-up. Although using Int8 gives even higher speed-ups (~times 3.6 for 896x896 input), for some tasks it degrades the performance too much.
I will definitely try your setup :)