|
|
|
|
|
by andrew-wja
2742 days ago
|
|
cuDNN's autotuner is only making local decisions; If layer X is connected to layer Y is connected to layer Z, simply choosing the fastest algorithm to implement each layer does not guarantee optimality. That's because the different convolution algorithms work best using different data layouts. For example, winograd convolution is usually fastest using NHWC layout, while patch-matrix based approaches like im2col are usually fastest using NCHW layouts. When two connected layers disagree about the tensor layout, you have to do a data layout transformation, which is expensive. The key to the performance we're getting is that we benchmark those data layout transformations, and we make a global decision about the network, where the optimizer is aware that the cost for selecting algorithms for connected layers is the total cost of algo A + layout transform + algo B, not just the cost of algo A + algo B. So in a nutshell, the optimizer is making use of global information, while the autotuner is only using local information. If you have a squint at our paper, we actually model what the autotuner is doing; that's the "local optimal" bar on the graphs. We always beat it! |
|