Hacker News new | ask | show | jobs
by a2code 819 days ago
I would devise a somewhat loose metric. Consider you assign a percentage as to how much a binary is disassembled. As in, 0% means the binary is in assembly and 100% means the whole binary is now C code. The ideal decompiler would result in 100% for any binary.

My prediction is that this percentage will increase with time. It would be interesting to construct data for this metric.

It is important to define the limitations of using LLMs for this endeavor. I would like to emphasize your subtle point. The compiler used for the original binary may not be the same as the one you use. The probability of this increases with time, as compilers improve or the platform on which the binary runs becomes obsolete. This is a problem for validation, as in you cannot directly compare original assembly code with assembly after compiling C code (that came from decompiling).

Perhaps assembly routines could be given a likelihood, as in how sure the LLM is that some C code maps to assembly. Then, routines with hand-coded assembly would have a lower likelihood.

1 comments

Could you expand on how this metric would be practically defined?

The problem isn’t lifting to C code, but rather “good C code”. For example you can do a 1-to-1 translation on each assembly instruction to C code that will do the same Machine state changes. This is not usually why you want, as it comes with a lot of extra cruft. When people think “decompiler” they think of n output that looks like what they would personal write. But that’s very Ill-defined. And, personally idk how one would define such a thing.

I am brainstorming here.

In practice, perhaps a C program that acts as a validation test. The source code of this C program is not publicly available. Only the binary is distributed. Let us name the binary ctestbox.

When ctestbox is run, it creates a multiplicity of new text or binary files. Each of these is like a unit test.

Consider a tool that decompiles a binary. Given ctestbox, this tool should make a.out which when run, ideally creates identical text or binary files. Now you simply count the number of identical files as a metric.