|
|
|
|
|
by hoosieree
749 days ago
|
|
My dissertation worked on a similar problem. I used obfuscation to build a large dataset from a small set of ground-truth functions. Then built a model to classify unseen obfuscated binary code to the nearest of the known functions. The application I had in mind during the research was anti-malware static analysis, but optimization is really just the flipside of obfuscation. Something I'd like to try in the future is a diffusion model that treats obfuscation as "noise" to be removed. One thing I learned is that optimizing compilers produce very regular output. After normalizing addresses, the "vocabulary" size of basic blocks ends up pretty small, like ~2000 tokens. Certain "phrases" correlate with the original source code's semantics regardless of how much obfuscation you add on top. |
|