|
|
|
|
|
by fmap
2745 days ago
|
|
Ok, my first reaction is that this is that it's really wonderful work - straightforward and with a big payoff at the end. But this really begs the question: why hasn't this been done before? People have been throwing resources at machine learning for a decade now, and somehow nobody has thought to perform instruction selection before executing a model to optimize the kernels used? What other low-hanging fruit is out there? Automatic partitioning of networks over several GPUs and CPUs? Such dynamic load balancing algorithms have been available in the HPC literature since there was HPC literature. Fusing multiple primitives to simpler kernels? That's what linear algebra libraries have been doing for decades. Optimizing internal data layout (although that seems to be part of this paper)? Optimizing scheduling decisions to minimize data movement? --- Also since the author seems to be reading this thread: Have you tried measuring the tree-width of the instruction selection DAGs you generate for the PBQP problem? The heuristics for solving these problems in llvm are applicable to tree-width <= 2, but could be extended to, e.g., tree-width <= 4 without too much slowdown. I wonder if there is still an iota of performance to be gained here. :) |
|
To answer your other questions: we already have auto load balancing and primitive fusion, albeit rudimentary, but optimizing scheduling is the obvious next step. We've extended this stuff to use ILP, and we're on our way to press at the moment!
Re: tree width: the tree widths are huge, but the solver library we're using handles them :)