Hacker News new | ask | show | jobs
by andrew-wja 2741 days ago
The speedup is for the whole network, as the graph labels show! The point of the compiler is that you produce code that implements the entire forward pass so you can deploy that code where you need to do the inference.

I agree we need AVX-512 -- hopefully I can get access to a SkylakeX machine in the next few days.

1 comments

Even with AVX512, CPUs are going to be pretty slow for most networks of decent size. Is the purpose to easily embed small models into other code?
Yes, that's a big part of it. Also, if you want to do something like (for example) keyword spotting in audio on a small device, like a Cortex-M class processor, the constraints are really really difficult to satisfy: most of them have significantly less than 1MB of main memory, for a start! Like this guy, for example: https://openmv.io/products/openmv-cam-m7 -- 512KB of RAM, and it runs at 216MHz. You just can't run tensorflow on something like that; it takes real effort to produce something that can do inference in that context.

With that said, the techniques we've developed here are totally applicable to GPUs as well, and you might even expect that something like algorithmic choice would have an even bigger effect there, if we're just talking about delta-execution-time, but that's future work for us!

Thanks for the response! With respect to the GPU, this reminds me of Tensor Comprehensions. Do you see a fundamental advantage to your approach over their kernel evolution protocol?
Thanks for the question, this really gets to the heart of the issue!

We do see a fundamental advantage, and I'll try to explain what it is.

So TC's kernel evolution is based on autotuning, and what they are autotuning for is the parameters for a specific implementation strategy for convolution (what we would call a convolution algorithm).

There are two conceptual problems with this. The first is that you're constrained by what's written in the algorithm. So for example, you cannot take a direct convolution algorithm and hope the compiler will optimize it into a Winograd convolution. You can only move around within the algorithmic domain. Introducing algorithmic choice is a big conceptual difference.

The second is that we know that this is a problem that can be solved optimally. The autotuner can never tell you if what you have found is the best way of arranging the network, it can only tell you that what you have is the best candidate it's seen so far. But using a real optimization formulation, what comes out is a proof by construction that no better arrangement exists.

The obvious downside is just that the optimization might take longer to run than the autotuning, but what we observe in practice is that it takes the PBQP solver a few seconds to derive the solution from the microbenchmarked cost model, and the cost model takes a few minutes to build for most popular ImageNet networks (VGG-D, MobileNet, SqueezeNet, etc.)

Where the TC approach comes in is at the "leaves" of the problem. When you have the optimal algorithmic selection, then you can try to wiggle around trying to find the best way to map that to the hardware, and you might expect percentage return on investment. But the higher-level algorithmic selection gives you factors return on investment.