Hacker News new | ask | show | jobs
by jacb 2137 days ago
Have there been attempts to measure modern compilers' ability to optimize large programs against "perfect" machine code? The few times I've poked through demo scene binaries, I've felt a sense of awe at the complexity described by just a few instructions, but demos aren't starting from a specification and trying to find a compressed representation, they're trying to create small binaries that expand into something complex and interesting. Clearly this problem is NP-hard as hell (and most software doesn't need to be optimized instruction-by-instruction), but it would be incredibly neat to have an estimate of _how much worse_ our compilers might be.

I'm reminded of the old story about someone who tried to evolve an FPGA configuration to perform signal processing, and the final generation exploited imperfections in the chip to use fewer gates.

6 comments

That would indeed be very neat. I think it's also impossible to do in a reasonable way, because how well a compiler does is probably dependent in a huge way on the particular program. There is a huge class of program transformations that are technically correct but far too specific to implement in a general-purpose compiler, and I think how well a compiler does is mostly dependent on how many of those transformations happen to apply to your program at hand.

For example: suppose you have a program that renders a 3D scene from a hardcoded scene description. A compiler might do a lot of stuff to that program, but what it will certainly not do is notice that the whole hardcoded scene can be replaced with a small representation in the form of a distance estimation function, if the rendering algorithm is also changed from raytracing to distance estimation.

A human in a demo scene situation will certainly perform (or at least try to perform) that transformation, if only because distance estimation is well known to often provide compact representations of certain, a priori very complex, scenes.

Also, nitpick:

> Clearly this problem is NP-complete as hell

I think it's almost certainly not even in NP; that would require a polynomial algorithm that, given some evidence that you can choose, proves that a particular program really is optimal. I think that's still impossible (but am gladly proven wrong). If I'm correct, it's not NP-complete, it's NP-hard.

> how well a compiler does is mostly dependent on how many of those transformations happen to apply to your program at hand

Right, there's an infinite number of distinct useful code optimizations, there's a cost to checking if any given optimization can be applied, and some optimizations have _massive_ costs for very rare and specific savings, so any given compiler is making a practical decision to include some optimizations and omit others. There was some discussion in the Rust community about LLVM having a period where they weren't tracking regressions in compile times - so "the perfect optimizing compiler" isn't really a coherent goal. But I still wonder how much faster Optimal Chromium would be. Just another interesting number I'll never know, I suppose.

> I think it's almost certainly not even in NP

Yeah, that was sloppy of me, I think this is undecidable by Rice's theorem. Nice catch!

The field you're looking for is superoptimization. Last I checked most of the work focused on loop-free snippets so that program length could serve as a proxy for other performance metrics. STOKE shows up frequently in those discussions, but people have tried all kinds of things ranging from genetic algorithms to neural networks to SMT solvers.

To actually answer your question, compilers are pretty good, and so are expert human assemblers, but they both miss things even for short snippets.

And the flip side is that STOKE and friends can find new, faster ways to compile code... only for short snippets :).

That said, compilers and superoptimizers can go together well. One of my favorite examples is using program synthesis to generate peephole optimization rules for a compiler[1]. The synthesis step takes a long time and only works for short snippets, but if you can turn the results of that into rules a compiler's optimization system can use, you only have to pay the synthesis cost once and the compiler can apply those rules to speed up large programs.

[1]: http://www.cse.iitd.ac.in/~sbansal/pubs/asplos06.pdf

I think this is the story, it's good fun: https://www.damninteresting.com/on-the-origin-of-circuits/

The evolved circuit used fewer gates, but only worked on the device under test and when they removed gates that were apparently doing nothing, it failed to perform the task. ML cheating at its best!

"cheating" isn't, when it wasn't even taught what the rules were in the first place!

I wonder if there's a similar analogy with how a young kid or someone completely unfamiliar with something can come up with an amazing solution that more experienced people would never think of --- because the latter had "hardened" their thinking to accustomed patterns and didn't go beyond them.

It boils down to various kinds of normalization - as in functional programming's normal form. If you can get a normal form of an expression (or canonical form), you can fuse several seemingly different expressions into one and share its computation in several places.

As an example aside from functional programming, consider instruction set architectures.

There are 3-address RISC instructions and "x := y - z" can be encoded in 32^3 different ways. It is not practical to consider all of them as a separate thing and share them.

The situation is less severe in two address ISAs and even less severe in 1 address (accumulator based) ISA (8086 is a blend between 1 and 2 address ISA).

For zero address ISAs it is even better. There's only one way to compute y-z, by issuing "-", which takes two arguments from stack and places subtraction result into it. Situations that lead to that subtraction are less diverse than in RISC 3-address way. And here comes the gist: one can factor same sequences of opcodes into subroutine ("word" in Forth parlance) and replace these sequences with call to that subroutine.

By programming Forth you do LZ compression on programs.

If you can't compress your program more, you have a perfect Forth program, and I am not quite joking here.

The same applies, albeit not that dramatically and visible, to other things in programs. It is possible to determine algorithmic kernels in programs (by matching optimized data and control flow against a pattern) and replace them with different ones - less computation intensive, parallelized, parametrized, etc. There were such works in optimization literature.

Again, in the kernel matching optimization goes first, because optimization makes program more canonical, so to say. And matching on canonical forms is easier.

"Optimal" compilation is actually harder than NP-complete. Because program equivalence is undecidable, finding a canonical "optimal" representation for a program can't be done in general.
What about superoptimization? For instance, Stoke (http://stoke.stanford.edu/) or Souper (https://github.com/google/souper). They will not find all optimizations of course, and program equivalence is undecidable indeed, but they are a good shot at optimal compilation, I would say.
Just because a problem is undecidable that doesn't mean that you can't in fact solve it for a large set of interesting instances (maybe all instances you care about).
It's decidable provided the problem-space is finite, right?
Every finite problem is decidable in linear time because you can hardcode the solution.
Right, meaning decidability isn't an issue for superoptimisation of many kinds of problems.
It will be hard to find agreement on what is perfect machine code. Smallest binary? Fastest? Least number of instructions? Lowest power consumption? Able to run equally well across more CPU generations? Able to be understood and debugged by more? Using CISC silicon which has more functionality baked in versus an RISC one? etc...etc.