Thanks for the response! With respect to the GPU, this reminds me of Tensor Comprehensions. Do you see a fundamental advantage to your approach over their kernel evolution protocol?
Thanks for the question, this really gets to the heart of the issue!
We do see a fundamental advantage, and I'll try to explain what it is.
So TC's kernel evolution is based on autotuning, and what they are autotuning for is the parameters for a specific implementation strategy for convolution (what we would call a convolution algorithm).
There are two conceptual problems with this. The first is that you're constrained by what's written in the algorithm. So for example, you cannot take a direct convolution algorithm and hope the compiler will optimize it into a Winograd convolution. You can only move around within the algorithmic domain. Introducing algorithmic choice is a big conceptual difference.
The second is that we know that this is a problem that can be solved optimally. The autotuner can never tell you if what you have found is the best way of arranging the network, it can only tell you that what you have is the best candidate it's seen so far. But using a real optimization formulation, what comes out is a proof by construction that no better arrangement exists.
The obvious downside is just that the optimization might take longer to run than the autotuning, but what we observe in practice is that it takes the PBQP solver a few seconds to derive the solution from the microbenchmarked cost model, and the cost model takes a few minutes to build for most popular ImageNet networks (VGG-D, MobileNet, SqueezeNet, etc.)
Where the TC approach comes in is at the "leaves" of the problem. When you have the optimal algorithmic selection, then you can try to wiggle around trying to find the best way to map that to the hardware, and you might expect percentage return on investment. But the higher-level algorithmic selection gives you factors return on investment.
We do see a fundamental advantage, and I'll try to explain what it is.
So TC's kernel evolution is based on autotuning, and what they are autotuning for is the parameters for a specific implementation strategy for convolution (what we would call a convolution algorithm).
There are two conceptual problems with this. The first is that you're constrained by what's written in the algorithm. So for example, you cannot take a direct convolution algorithm and hope the compiler will optimize it into a Winograd convolution. You can only move around within the algorithmic domain. Introducing algorithmic choice is a big conceptual difference.
The second is that we know that this is a problem that can be solved optimally. The autotuner can never tell you if what you have found is the best way of arranging the network, it can only tell you that what you have is the best candidate it's seen so far. But using a real optimization formulation, what comes out is a proof by construction that no better arrangement exists.
The obvious downside is just that the optimization might take longer to run than the autotuning, but what we observe in practice is that it takes the PBQP solver a few seconds to derive the solution from the microbenchmarked cost model, and the cost model takes a few minutes to build for most popular ImageNet networks (VGG-D, MobileNet, SqueezeNet, etc.)
Where the TC approach comes in is at the "leaves" of the problem. When you have the optimal algorithmic selection, then you can try to wiggle around trying to find the best way to map that to the hardware, and you might expect percentage return on investment. But the higher-level algorithmic selection gives you factors return on investment.