| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dragandj 2744 days ago

A few minor notes:

1. This seems to be oriented to convolutions. While convolution is rather important for image-oriented DNN workloads, there is DNN beyond that, and DNN are not the only technique for machine learning.

2. The graph shows 2-3x speedup (of convolutions, I suppose) over Intel's MKL-DNN 0.17 on the i5-2500K processor, which is a rather old low-end device. If the format used for convolutions in the test used 8-bit integers for storing image channels (which is possible) this is to be expected, since i5-2500 does not suport AVX-512 integer instructions that are employed there by MKL-DNN. It does not even have AVX-2! If that's the case, just switching to 32-bit float could speed up MKL-DNN almost an order of magnitude. The most informative test would be something run on SkylakeX, since it does support AVX-512...

2 comments

andrew-wja 2736 days ago

We have some updated numbers, including benchmarking on Kaby Lake (AVX2): https://www.scss.tcd.ie/~andersan/projects/live/triNNity.htm...

I believe I will have access to a Skylake-X machine in the next few days, so hopefully I can post AVX-512 results soon as well.

link

andrew-wja 2744 days ago

The speedup is for the whole network, as the graph labels show! The point of the compiler is that you produce code that implements the entire forward pass so you can deploy that code where you need to do the inference.

I agree we need AVX-512 -- hopefully I can get access to a SkylakeX machine in the next few days.

link

stochastic_monk 2744 days ago

Even with AVX512, CPUs are going to be pretty slow for most networks of decent size. Is the purpose to easily embed small models into other code?

link

andrew-wja 2744 days ago

Yes, that's a big part of it. Also, if you want to do something like (for example) keyword spotting in audio on a small device, like a Cortex-M class processor, the constraints are really really difficult to satisfy: most of them have significantly less than 1MB of main memory, for a start! Like this guy, for example: https://openmv.io/products/openmv-cam-m7 -- 512KB of RAM, and it runs at 216MHz. You just can't run tensorflow on something like that; it takes real effort to produce something that can do inference in that context.

With that said, the techniques we've developed here are totally applicable to GPUs as well, and you might even expect that something like algorithmic choice would have an even bigger effect there, if we're just talking about delta-execution-time, but that's future work for us!

link

stochastic_monk 2742 days ago

Thanks for the response! With respect to the GPU, this reminds me of Tensor Comprehensions. Do you see a fundamental advantage to your approach over their kernel evolution protocol?

link

andrew-wja 2741 days ago

Thanks for the question, this really gets to the heart of the issue!

We do see a fundamental advantage, and I'll try to explain what it is.

So TC's kernel evolution is based on autotuning, and what they are autotuning for is the parameters for a specific implementation strategy for convolution (what we would call a convolution algorithm).

There are two conceptual problems with this. The first is that you're constrained by what's written in the algorithm. So for example, you cannot take a direct convolution algorithm and hope the compiler will optimize it into a Winograd convolution. You can only move around within the algorithmic domain. Introducing algorithmic choice is a big conceptual difference.

The second is that we know that this is a problem that can be solved optimally. The autotuner can never tell you if what you have found is the best way of arranging the network, it can only tell you that what you have is the best candidate it's seen so far. But using a real optimization formulation, what comes out is a proof by construction that no better arrangement exists.

The obvious downside is just that the optimization might take longer to run than the autotuning, but what we observe in practice is that it takes the PBQP solver a few seconds to derive the solution from the microbenchmarked cost model, and the cost model takes a few minutes to build for most popular ImageNet networks (VGG-D, MobileNet, SqueezeNet, etc.)

Where the TC approach comes in is at the "leaves" of the problem. When you have the optimal algorithmic selection, then you can try to wiggle around trying to find the best way to map that to the hardware, and you might expect percentage return on investment. But the higher-level algorithmic selection gives you factors return on investment.

link