Hacker News new | ask | show | jobs
by fmap 2745 days ago
Ok, my first reaction is that this is that it's really wonderful work - straightforward and with a big payoff at the end.

But this really begs the question: why hasn't this been done before? People have been throwing resources at machine learning for a decade now, and somehow nobody has thought to perform instruction selection before executing a model to optimize the kernels used?

What other low-hanging fruit is out there? Automatic partitioning of networks over several GPUs and CPUs? Such dynamic load balancing algorithms have been available in the HPC literature since there was HPC literature. Fusing multiple primitives to simpler kernels? That's what linear algebra libraries have been doing for decades. Optimizing internal data layout (although that seems to be part of this paper)? Optimizing scheduling decisions to minimize data movement?

---

Also since the author seems to be reading this thread: Have you tried measuring the tree-width of the instruction selection DAGs you generate for the PBQP problem? The heuristics for solving these problems in llvm are applicable to tree-width <= 2, but could be extended to, e.g., tree-width <= 4 without too much slowdown. I wonder if there is still an iota of performance to be gained here. :)

2 comments

Hi, author here. There is a staggering amount of low hanging fruit. I have been half-seriously blaming GEMM in correspondence. When you have a problem that looks like GEMM, it's such an attractive hammer to pick up that people just don't look beyond it to other techniques!

To answer your other questions: we already have auto load balancing and primitive fusion, albeit rudimentary, but optimizing scheduling is the obvious next step. We've extended this stuff to use ILP, and we're on our way to press at the moment!

Re: tree width: the tree widths are huge, but the solver library we're using handles them :)

how tied are the compiler aspects to the specific kernels (eg for convolutional layers)? Could load balancing and fusion logic be broken out into a library which could work for user defined kernels?
They aren't tied at all -- in fact the optimizer is a totally separate project (triNNity-optimizer) that just does graph optimization. You can add user defined kernels as long as you have some way of microbenchmarking them!
I just realised the implication in your last question is that a heuristic solution to the PBQP problem is obtained. In fact, for a lot of networks, we get the optimal solution :)

DNN DAGs are big, but they are very simple structurally compared to the kind of stuff you see in a real ISA!

That's amazing! Do you have any insight as to which structural properties make this problem simple?

The reason I was asking after tree-width is that control and dataflow graphs in structured programming languages usually have small tree-width (which gets added to the maximum number of overlapping patterns during instruction selection, more or less) and this leads to the fast heuristics used by the PBQP solvers in llvm and libfirm. It's been a few years since I looked into this though, so I'm probably a bit behind the state of the art. :)

Primarily it's down to the fact that most programs have control flow!

As you most likely know, DAG selectors have to tile the CFG (which can have back-edges due to control flow) into a forest of DAGs, which are then fed to the instruction selector. So from a big program, you get a bunch of DAGs. The DAGs are selected separately, which means that when they execute one-after-the-other or in parallel, there may be inter-DAG conflicts on things like issue ports, or memory ports in ALUs. I'm sure there are a billion mitigation strategies for this, but for DNNs the landscape is a little different.

With a DNN you may have parallel paths, but they are typically few, and there are no back-edges. So the entire program is a DAG, and you don't have to tile and select separately!