Hacker News new | ask | show | jobs
by andrew-wja 2742 days ago
cuDNN's autotuner is only making local decisions; If layer X is connected to layer Y is connected to layer Z, simply choosing the fastest algorithm to implement each layer does not guarantee optimality. That's because the different convolution algorithms work best using different data layouts. For example, winograd convolution is usually fastest using NHWC layout, while patch-matrix based approaches like im2col are usually fastest using NCHW layouts.

When two connected layers disagree about the tensor layout, you have to do a data layout transformation, which is expensive. The key to the performance we're getting is that we benchmark those data layout transformations, and we make a global decision about the network, where the optimizer is aware that the cost for selecting algorithms for connected layers is the total cost of algo A + layout transform + algo B, not just the cost of algo A + algo B.

So in a nutshell, the optimizer is making use of global information, while the autotuner is only using local information. If you have a squint at our paper, we actually model what the autotuner is doing; that's the "local optimal" bar on the graphs. We always beat it!

1 comments

I see. So what do you intend it to become? Are you building a middleware to be inserted between, say, PyTorch and cuDNN? I'm training convnets and rnns written in PyTorch and TF on GPUs, how can I benefit from your work?
You're essentially correct, but there is a bit of a problem with PyTorch and TF specifically, because you don't really have a definition of the model per se. You construct it dynamically using a Python or C++ program.

The Caffe .prototxt format or the ONNX model format are nice declarative specifications for what the model is supposed to do; so those are good input formats for the compiler. I hope more frameworks will prioritize ONNX, because it's really the wild west out here with every framework reinventing the wheel for model specification!