Hacker News new | ask | show | jobs
by hoosieree 749 days ago
My dissertation worked on a similar problem. I used obfuscation to build a large dataset from a small set of ground-truth functions. Then built a model to classify unseen obfuscated binary code to the nearest of the known functions.

The application I had in mind during the research was anti-malware static analysis, but optimization is really just the flipside of obfuscation. Something I'd like to try in the future is a diffusion model that treats obfuscation as "noise" to be removed.

One thing I learned is that optimizing compilers produce very regular output. After normalizing addresses, the "vocabulary" size of basic blocks ends up pretty small, like ~2000 tokens. Certain "phrases" correlate with the original source code's semantics regardless of how much obfuscation you add on top.