Full disclosure: I'm a compiler person who for funding reasons moved into performance of machine learning systems.
None of those things should be called compilers. At best, they are scaffolding for peephole optimization. When you can get these crazy speedups just from bolting on an instruction selector, that's a real indicator that a lot of stuff is just waiting to be done.
For context, MKL-DNN embeds XBYAK, an optimizing JIT targeting SSE4.2, AVX2, and AVX512. It sees all the dimensions of the tensors, it knows the strides of the kernel, and so on and so forth. So for us to be able to beat it by such a margin just by stepping back and doing something simple at a higher level of abstraction kinda indicates that the approach it's using is running up against some conceptual limits. It's not that MKL-DNN's JIT isn't good -- it's great, and it's a credit to the engineers working on it. But the problem is that the smarts are being applied in the wrong place!
cuDNN's autotuner is only making local decisions; If layer X is connected to layer Y is connected to layer Z, simply choosing the fastest algorithm to implement each layer does not guarantee optimality. That's because the different convolution algorithms work best using different data layouts. For example, winograd convolution is usually fastest using NHWC layout, while patch-matrix based approaches like im2col are usually fastest using NCHW layouts.
When two connected layers disagree about the tensor layout, you have to do a data layout transformation, which is expensive. The key to the performance we're getting is that we benchmark those data layout transformations, and we make a global decision about the network, where the optimizer is aware that the cost for selecting algorithms for connected layers is the total cost of algo A + layout transform + algo B, not just the cost of algo A + algo B.
So in a nutshell, the optimizer is making use of global information, while the autotuner is only using local information. If you have a squint at our paper, we actually model what the autotuner is doing; that's the "local optimal" bar on the graphs. We always beat it!
I see. So what do you intend it to become? Are you building a middleware to be inserted between, say, PyTorch and cuDNN? I'm training convnets and rnns written in PyTorch and TF on GPUs, how can I benefit from your work?
You're essentially correct, but there is a bit of a problem with PyTorch and TF specifically, because you don't really have a definition of the model per se. You construct it dynamically using a Python or C++ program.
The Caffe .prototxt format or the ONNX model format are nice declarative specifications for what the model is supposed to do; so those are good input formats for the compiler. I hope more frameworks will prioritize ONNX, because it's really the wild west out here with every framework reinventing the wheel for model specification!
None of those things should be called compilers. At best, they are scaffolding for peephole optimization. When you can get these crazy speedups just from bolting on an instruction selector, that's a real indicator that a lot of stuff is just waiting to be done.
For context, MKL-DNN embeds XBYAK, an optimizing JIT targeting SSE4.2, AVX2, and AVX512. It sees all the dimensions of the tensors, it knows the strides of the kernel, and so on and so forth. So for us to be able to beat it by such a margin just by stepping back and doing something simple at a higher level of abstraction kinda indicates that the approach it's using is running up against some conceptual limits. It's not that MKL-DNN's JIT isn't good -- it's great, and it's a credit to the engineers working on it. But the problem is that the smarts are being applied in the wrong place!