1. This seems to be oriented to convolutions. While convolution is rather important for image-oriented DNN workloads, there is DNN beyond that, and DNN are not the only technique for machine learning.
2. The graph shows 2-3x speedup (of convolutions, I suppose) over Intel's MKL-DNN 0.17 on the i5-2500K processor, which is a rather old low-end device. If the format used for convolutions in the test used 8-bit integers for storing image channels (which is possible) this is to be expected, since i5-2500 does not suport AVX-512 integer instructions that are employed there by MKL-DNN. It does not even have AVX-2! If that's the case, just switching to 32-bit float could speed up MKL-DNN almost an order of magnitude. The most informative test would be something run on SkylakeX, since it does support AVX-512...
The speedup is for the whole network, as the graph labels show! The point of the compiler is that you produce code that implements the entire forward pass so you can deploy that code where you need to do the inference.
I agree we need AVX-512 -- hopefully I can get access to a SkylakeX machine in the next few days.
Yes, that's a big part of it. Also, if you want to do something like (for example) keyword spotting in audio on a small device, like a Cortex-M class processor, the constraints are really really difficult to satisfy: most of them have significantly less than 1MB of main memory, for a start! Like this guy, for example: https://openmv.io/products/openmv-cam-m7 -- 512KB of RAM, and it runs at 216MHz. You just can't run tensorflow on something like that; it takes real effort to produce something that can do inference in that context.
With that said, the techniques we've developed here are totally applicable to GPUs as well, and you might even expect that something like algorithmic choice would have an even bigger effect there, if we're just talking about delta-execution-time, but that's future work for us!
Thanks for the response! With respect to the GPU, this reminds me of Tensor Comprehensions. Do you see a fundamental advantage to your approach over their kernel evolution protocol?
Thanks for the question, this really gets to the heart of the issue!
We do see a fundamental advantage, and I'll try to explain what it is.
So TC's kernel evolution is based on autotuning, and what they are autotuning for is the parameters for a specific implementation strategy for convolution (what we would call a convolution algorithm).
There are two conceptual problems with this. The first is that you're constrained by what's written in the algorithm. So for example, you cannot take a direct convolution algorithm and hope the compiler will optimize it into a Winograd convolution. You can only move around within the algorithmic domain. Introducing algorithmic choice is a big conceptual difference.
The second is that we know that this is a problem that can be solved optimally. The autotuner can never tell you if what you have found is the best way of arranging the network, it can only tell you that what you have is the best candidate it's seen so far. But using a real optimization formulation, what comes out is a proof by construction that no better arrangement exists.
The obvious downside is just that the optimization might take longer to run than the autotuning, but what we observe in practice is that it takes the PBQP solver a few seconds to derive the solution from the microbenchmarked cost model, and the cost model takes a few minutes to build for most popular ImageNet networks (VGG-D, MobileNet, SqueezeNet, etc.)
Where the TC approach comes in is at the "leaves" of the problem. When you have the optimal algorithmic selection, then you can try to wiggle around trying to find the best way to map that to the hardware, and you might expect percentage return on investment. But the higher-level algorithmic selection gives you factors return on investment.
Ok, my first reaction is that this is that it's really wonderful work - straightforward and with a big payoff at the end.
But this really begs the question: why hasn't this been done before? People have been throwing resources at machine learning for a decade now, and somehow nobody has thought to perform instruction selection before executing a model to optimize the kernels used?
What other low-hanging fruit is out there? Automatic partitioning of networks over several GPUs and CPUs? Such dynamic load balancing algorithms have been available in the HPC literature since there was HPC literature. Fusing multiple primitives to simpler kernels? That's what linear algebra libraries have been doing for decades. Optimizing internal data layout (although that seems to be part of this paper)? Optimizing scheduling decisions to minimize data movement?
---
Also since the author seems to be reading this thread: Have you tried measuring the tree-width of the instruction selection DAGs you generate for the PBQP problem? The heuristics for solving these problems in llvm are applicable to tree-width <= 2, but could be extended to, e.g., tree-width <= 4 without too much slowdown. I wonder if there is still an iota of performance to be gained here. :)
Hi, author here. There is a staggering amount of low hanging fruit. I have been half-seriously blaming GEMM in correspondence. When you have a problem that looks like GEMM, it's such an attractive hammer to pick up that people just don't look beyond it to other techniques!
To answer your other questions: we already have auto load balancing and primitive fusion, albeit rudimentary, but optimizing scheduling is the obvious next step. We've extended this stuff to use ILP, and we're on our way to press at the moment!
Re: tree width: the tree widths are huge, but the solver library we're using handles them :)
how tied are the compiler aspects to the specific kernels (eg for convolutional layers)? Could load balancing and fusion logic be broken out into a library which could work for user defined kernels?
They aren't tied at all -- in fact the optimizer is a totally separate project (triNNity-optimizer) that just does graph optimization. You can add user defined kernels as long as you have some way of microbenchmarking them!
I just realised the implication in your last question is that a heuristic solution to the PBQP problem is obtained. In fact, for a lot of networks, we get the optimal solution :)
DNN DAGs are big, but they are very simple structurally compared to the kind of stuff you see in a real ISA!
That's amazing! Do you have any insight as to which structural properties make this problem simple?
The reason I was asking after tree-width is that control and dataflow graphs in structured programming languages usually have small tree-width (which gets added to the maximum number of overlapping patterns during instruction selection, more or less) and this leads to the fast heuristics used by the PBQP solvers in llvm and libfirm. It's been a few years since I looked into this though, so I'm probably a bit behind the state of the art. :)
Primarily it's down to the fact that most programs have control flow!
As you most likely know, DAG selectors have to tile the CFG (which can have back-edges due to control flow) into a forest of DAGs, which are then fed to the instruction selector. So from a big program, you get a bunch of DAGs. The DAGs are selected separately, which means that when they execute one-after-the-other or in parallel, there may be inter-DAG conflicts on things like issue ports, or memory ports in ALUs. I'm sure there are a billion mitigation strategies for this, but for DNNs the landscape is a little different.
With a DNN you may have parallel paths, but they are typically few, and there are no back-edges. So the entire program is a DAG, and you don't have to tile and select separately!
Is there any plan to make this compile to cuda? I can't imagine bothering with DNNs on an intel cpu in virtually any situation, even if it is 3x faster this way.
Full disclosure: I'm a compiler person who for funding reasons moved into performance of machine learning systems.
None of those things should be called compilers. At best, they are scaffolding for peephole optimization. When you can get these crazy speedups just from bolting on an instruction selector, that's a real indicator that a lot of stuff is just waiting to be done.
For context, MKL-DNN embeds XBYAK, an optimizing JIT targeting SSE4.2, AVX2, and AVX512. It sees all the dimensions of the tensors, it knows the strides of the kernel, and so on and so forth. So for us to be able to beat it by such a margin just by stepping back and doing something simple at a higher level of abstraction kinda indicates that the approach it's using is running up against some conceptual limits. It's not that MKL-DNN's JIT isn't good -- it's great, and it's a credit to the engineers working on it. But the problem is that the smarts are being applied in the wrong place!
cuDNN's autotuner is only making local decisions; If layer X is connected to layer Y is connected to layer Z, simply choosing the fastest algorithm to implement each layer does not guarantee optimality. That's because the different convolution algorithms work best using different data layouts. For example, winograd convolution is usually fastest using NHWC layout, while patch-matrix based approaches like im2col are usually fastest using NCHW layouts.
When two connected layers disagree about the tensor layout, you have to do a data layout transformation, which is expensive. The key to the performance we're getting is that we benchmark those data layout transformations, and we make a global decision about the network, where the optimizer is aware that the cost for selecting algorithms for connected layers is the total cost of algo A + layout transform + algo B, not just the cost of algo A + algo B.
So in a nutshell, the optimizer is making use of global information, while the autotuner is only using local information. If you have a squint at our paper, we actually model what the autotuner is doing; that's the "local optimal" bar on the graphs. We always beat it!
I see. So what do you intend it to become? Are you building a middleware to be inserted between, say, PyTorch and cuDNN? I'm training convnets and rnns written in PyTorch and TF on GPUs, how can I benefit from your work?
You're essentially correct, but there is a bit of a problem with PyTorch and TF specifically, because you don't really have a definition of the model per se. You construct it dynamically using a Python or C++ program.
The Caffe .prototxt format or the ONNX model format are nice declarative specifications for what the model is supposed to do; so those are good input formats for the compiler. I hope more frameworks will prioritize ONNX, because it's really the wild west out here with every framework reinventing the wheel for model specification!
No changes are required in the library -- you just need to have some way of generating the code for the forward pass using e.g. cuDNN (which already has a heuristic selector!)
If I write gemm.cl like above program implementing a bunch of CNN primitives, why I cannot compile it with gcc.
For compiling any Opencl program, I need to write kernel in suppose sample.cl file and compile it with
gcc –Wall sample.cl –o sample -lOpenCL
This -lOpenCL picks up libopencl.so file from my Hardware vendor(Qcom, Intel) and generates the binary which runs it on GPU(or whereever Opencl is available)
Why do I need anything trinity compiler and optimizer.
The short answer is that there are dozens of ways to use that GEMM to do convolution (convolution algorithms), and there are NUM_ALGORITHMS * NUM_LAYERS way to implement the network.
Our toolkit figures out which of those arrangements are the fast ones!
1. This seems to be oriented to convolutions. While convolution is rather important for image-oriented DNN workloads, there is DNN beyond that, and DNN are not the only technique for machine learning.
2. The graph shows 2-3x speedup (of convolutions, I suppose) over Intel's MKL-DNN 0.17 on the i5-2500K processor, which is a rather old low-end device. If the format used for convolutions in the test used 8-bit integers for storing image channels (which is possible) this is to be expected, since i5-2500 does not suport AVX-512 integer instructions that are employed there by MKL-DNN. It does not even have AVX-2! If that's the case, just switching to 32-bit float could speed up MKL-DNN almost an order of magnitude. The most informative test would be something run on SkylakeX, since it does support AVX-512...