Hacker News new | ask | show | jobs
by can16358p 750 days ago
I wonder how this would apply to compiler/interpreter optimizations.

Is it possible that it can "disect" some parts of the execution, perhaps at assembly level, and come up with optimizations specific to the compiled code without changing the output (I mean expected program output, not emitted binary), that modern compilers have not deterministically come up with?

3 comments

I expect the answer is "no". I wouldn't expect a tool like this to "discover" assembly without being trained on the compiled output. The model has no notion of how or where the code runs.

After decades of compiler research and super compilers chugging away, we're sort of at a point where discovering novel optimizations with results that are more than a smidge of improvement is almost impossibly unlikely. Compilers today are really good.

That said, I think the value that something like this might have is being able to optimize the intent of the code. If it can determine that I'm sorting some numbers, it can rewrite my code to use a faster sorting algorithm that has the same functional properties. If I'm storing data that never gets used, it can stop storing it. It has a view of the code at a level above what the compiler sees, with an understanding not just of what is being done, but why.

> After decades of compiler research and super compilers chugging away, we're sort of at a point where discovering novel optimizations with results that are more than a smidge of improvement is almost impossibly unlikely. Compilers today are really good.

I agree when it comes to peephole optimizations, but there's still a lot of juice left in exploiting language guarantees (immutability, non-aliasing, data-parallelism), however most compiler developer energy is spent propping up C/C++ and consequently optimizations are developed with those languages in mind.

My dissertation worked on a similar problem. I used obfuscation to build a large dataset from a small set of ground-truth functions. Then built a model to classify unseen obfuscated binary code to the nearest of the known functions.

The application I had in mind during the research was anti-malware static analysis, but optimization is really just the flipside of obfuscation. Something I'd like to try in the future is a diffusion model that treats obfuscation as "noise" to be removed.

One thing I learned is that optimizing compilers produce very regular output. After normalizing addresses, the "vocabulary" size of basic blocks ends up pretty small, like ~2000 tokens. Certain "phrases" correlate with the original source code's semantics regardless of how much obfuscation you add on top.

This is called superoptimization: https://en.wikipedia.org/wiki/Superoptimization

There are people applying synthesis techniques to superoptimization. So something like this would possibly apply.